Re: Exposing the lock manager's WaitForLockers() to SQL
I got some very helpful off-list feedback from Robert Haas that this needed more self-contained explanation/motivation. So here goes. :-) This patch set adds a new SQL function pg_wait_for_lockers(), which waits for transactions holding specified table locks to commit or roll back. This can be useful with knowledge of the queries in those transactions, particularly for asynchronous and incremental processing of inserted/updated rows. Specifically, consider a scenario where INSERTs and UPDATEs always set a serial column to its default value. A client can call pg_sequence_last_value() + pg_wait_for_lockers() and then take a new DB snapshot and know that rows committed after this snapshot will have values of the serial column greater than the value from pg_sequence_last_value(). As shown in the example at the end, this allows the client to asynchronously and incrementally read inserted/updated rows with minimal per-client state, without buffering changes, and without affecting writer transactions. There are lots of other ways to support incrementally reading new rows, but they don’t have all of those qualities. For example: * Forcing writers to commit in a specific order (e.g. by serial column value) would reduce throughput * Explicitly tracking or coordinating with writers would likely be more complex, impact performance, and/or require much more state * Methods that are synchronous or buffer/queue changes are problematic if readers fall behind Existing ways to wait for table locks also have downsides: * Taking a conflicting lock with LOCK blocks new transactions from taking the lock of interest while LOCK waits. And in order to wait for writers holding RowExclusiveLock, we must take ShareLock, which also conflicts with ShareUpdateExclusiveLock and therefore unnecessarily interferes with (auto)vacuum. Finally, with multiple tables LOCK locks them one at a time, so it waits (and holds locks) longer than necessary. * Using pg_locks / pg_lock_status() to identify the transactions holding the locks is more expensive since it also returns all other locks in the DB cluster, plus there’s no efficient built-in way to wait for the transactions to commit or roll back. By contrast, pg_wait_for_lockers() doesn’t block other transactions, waits on multiple tables in parallel, and doesn’t spend time looking at irrelevant locks. This change is split into three patches for ease of review. The first two patches modify the existing WaitForLockers() C function and other locking internals to support waiting for lockers in a single lock mode, which allows waiting for INSERT/UPDATE without waiting for vacuuming. These changes could be omitted at the cost of unnecessary waiting, potentially for a long time with slow vacuums. The third patch adds the pg_wait_for_lockers() SQL function, which just calls WaitForLockers(). FWIW, another solution might be to directly expose the functions that WaitForLockers() calls, namely GetLockConflicts() (generalized to GetLockers() in the first patch) to identify the transactions holding the locks, and VirtualXactLock() to wait for each transaction to commit or roll back. That would be more complicated for the client but could be more broadly useful. I could investigate that further if it seems preferable. === Example === Assume we have the following table: CREATE TABLE page_views ( id bigserial, view_time timestamptz ); which is only ever modified by (potentially concurrent) INSERT commands that assign the default value to the id column. We can run the following commands: SELECT pg_sequence_last_value('page_views_id_seq'); pg_sequence_last_value 4 SELECT pg_wait_for_lockers(array['page_views']::oid, regclass[], 'RowExclusiveLock', FALSE); Now we know that all rows where id <= 4 have been committed or rolled back, and we can observe/process them: SELECT * FROM page_views WHERE id <= 4; id | view_time +--- 2 | 2024-01-01 12:34:01.00-00 3 | 2024-01-01 12:34:00.00-00 Later we can iterate: SELECT pg_sequence_last_value('page_views_id_seq'); pg_sequence_last_value 9 SELECT pg_wait_for_lockers(array['page_views']::oid, regclass[], 'RowExclusiveLock', FALSE); We already observed all the rows where id <= 4, so this time we can filter them out: SELECT * FROM page_views WHERE id > 4 AND id <= 9; id | view_time +--- 5 | 2024-01-01 12:34:05.00-00 8 | 2024-01-01 12:34:04.00-00 9 | 2024-01-01 12:34:07.00-00 We can continue iterating like this to incrementally observe more newly inserted rows. Note that the only state we persist across iterations is the value returned by pg_sequence_last_value(). In this example, we processed inserted rows exactly once. Variations are possible for handling updates, as discussed in the original email, and I could explain that again
Re: Exposing the lock manager's WaitForLockers() to SQL
I should add that the latest patches remove permissions checks because pg_locks doesn't have any, and improve the commit messages. Hope I didn't garble anything doing this late after the dev conference. :-) Robert asked me about other existing functions that could be leveraged, such as GetConflictingVirtualXIDs(), but I didn't see any with close-enough semantics that handle fast-path locks as needed for tables/relations.
Re: meson "experimental"?
Hi. > "Alternatively, PostgreSQL can be built using Meson. This is the only > option for building PostgreSQL in Windows using Visual Something[*]. > For other platforms, using Meson is currently experimental." +1 good catch > [*] What is the correct name for this? I believe in this section it should be "Visual Studio" as we specify elsewhere [1][2]. In [2] we name specific required versions. Maybe we should reference this section. While on it, in [2] section 17.7.5 is named "Visual". I don't think such a product exists and find it confusing. Maybe we should rename the section to "Visual Studio". Also I don't see any mention of the minimum required version of Ninja. I think we should add it too, or maybe reference the corresponding section I couldn't find in "17.1 Requirements" [1]: https://www.postgresql.org/docs/devel/install-meson.html [2]: https://www.postgresql.org/docs/devel/installation-platform-notes.html -- Best regards, Aleksander Alekseev
Re: meson "experimental"?
Hi, > > [*] What is the correct name for this? > > I believe in this section it should be "Visual Studio" as we specify > elsewhere [1][2]. In [2] we name specific required versions. Maybe we > should reference this section. > > While on it, in [2] section 17.7.5 is named "Visual". I don't think > such a product exists and find it confusing. Maybe we should rename > the section to "Visual Studio". Here are corresponding patches. > Also I don't see any mention of the minimum required version of Ninja. > I think we should add it too, or maybe reference the corresponding > section I couldn't find in "17.1 Requirements" > > [1]: https://www.postgresql.org/docs/devel/install-meson.html > [2]: https://www.postgresql.org/docs/devel/installation-platform-notes.html By a quick look on the buildfarm we seem to use Ninja >= 1.11.1. However since Meson can use both Ninja and VS as a backend I'm not certain which section would be most appropriate for naming the minimal required version of Ninja. -- Best regards, Aleksander Alekseev v1-0001-Meson-is-not-experimental-on-Windows.patch Description: Binary data v1-0002-Rename-section-17.7.5-to-Visual-Studio.patch Description: Binary data
Re: Implementing CustomScan over an index
Hi, > So I started implementing a CustomScan. It's not trivial. > > I've learned that the system executes ExecInitCustomScan automatically, but I > probably need it to do most of the stuff in ExecInitIndexScan, and then > execute the scan mostly the way it's done in IndexNext. > > Basically, I want just a normal index scan, but with the ability to do custom > things with the quals and the projection. > > So... what's the best approach? > > Is there any sample code that does this? A search of github doesn't turn up > much. > > Is there any way to do this without duplicating everything in nodeIndexscan.c > myself? Yes, unfortunately it is not quite trivial. There is a "Writing a Custom Scan Provider" chapter in the documentation that may help [1]. TimescaleDB uses CustomScans, maybe using its code as an example will help [2]. Hint: exploring `git log` is often helpful too. If something in the documentation is not clear, maybe it can be improved. Let us know here or (even better) provide a patch. If you have a particular piece of code that doesn't do what you want, try uploading a minimal example on GitHub and asking here. By a quick look I couldn't find an example of implementing a CustomScan in ./contrib/ or ./src/test/. If you can think of a usage example of CustomScans, consider contributing a test case. [1]: https://www.postgresql.org/docs/current/custom-scan.html [2]: https://github.com/timescale/timescaledb/ -- Best regards, Aleksander Alekseev
The xversion-upgrade test fails to stop server
Hello Andrew, While reviewing recent buildfarm failures, I came across this one: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2024-05-23%2004%3A11%3A03 upgrade.crake/REL_16_STABLE/REL9_5_STABLE-ctl4.log waiting for server to shut down... failed pg_ctl: server does not shut down Looking at: https://github.com/PGBuildFarm/client-code/blob/05014d50e/PGBuild/Modules/TestUpgradeXversion.pm#L641 I see that ctl4.log is created after updating extensions and REL9_5_STABLE-update_extensions.log contains: You are now connected to database "contrib_regression_redis_fdw" as user "buildfarm". ALTER EXTENSION "hstore" UPDATE; ALTER EXTENSION You are now connected to database "contrib_regression_btree_gin" as user "buildfarm". ALTER EXTENSION "btree_gin" UPDATE; ALTER EXTENSION ... but I see no corresponding server log file containing these commands in the failure log. When running the same test locally, I find these in inst/upgrade_log. Maybe uploading this log file too would help to understand what is the cause of the failure... Best regards, Alexander
Re: meson "experimental"?
On Thu, May 30, 2024 at 6:32 AM Aleksander Alekseev < aleksan...@timescale.com> wrote: > > > By a quick look on the buildfarm we seem to use Ninja >= 1.11.1. > However since Meson can use both Ninja and VS as a backend I'm not > certain which section would be most appropriate for naming the minimal > required version of Ninja. > > When I tried building with the VS backend it blew up, I don't recall the details. I think we should just use ninja everywhere. That keeps things simple. On Windows I just install python and then do "pip install meson ninja" cheers andrew
Re: Add LSN <-> time conversion functionality
Hi everyone! Me, Bharath, and Ilya are on patch review session at the PGConf.dev :) Maybe we got everything wrong, please consider that we are just doing training on reviewing patches. === Purpose of the patch === Currently, we have checkpoint_timeout and max_wal size to know when we need a checkpoint. This patch brings a capability to freeze page not only by internal state of the system, but also by wall clock time. To do so we need an infrastructure which will tell when page was modified. The patch in this thread is doing exactly this: in-memory information to map LSNs with wall clock time. Mapping is maintained by bacgroundwriter. === Questions === 1. The patch does not handle server restart. All pages will need freeze after any crash? 2. Some benchmarks to proof the patch does not have CPU footprint. === Nits === "Timeline" term is already taken. The patch needs rebase due to some header changes. Tests fail on Windows. The patch lacks tests. Some docs would be nice, but the feature is for developers. Mapping is protected for multithreaded access by walstats LWlock and might have tuplestore_putvalues() under that lock. That might be a little dangerous, if tuplestore will be on-disk for some reason (should not happen). Overall, the patch is a base for good feature which would help to do freeze right in time. Thanks! Best regards, Bharath, Andrey, Ilya.
Re: The xversion-upgrade test fails to stop server
Sent from my iPhone > On May 30, 2024, at 8:00 AM, Alexander Lakhin wrote: > > Hello Andrew, > > While reviewing recent buildfarm failures, I came across this one: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2024-05-23%2004%3A11%3A03 > > upgrade.crake/REL_16_STABLE/REL9_5_STABLE-ctl4.log > waiting for server to shut > down... > failed > pg_ctl: server does not shut down > > Looking at: > https://github.com/PGBuildFarm/client-code/blob/05014d50e/PGBuild/Modules/TestUpgradeXversion.pm#L641 > > I see that ctl4.log is created after updating extensions and > REL9_5_STABLE-update_extensions.log contains: > You are now connected to database "contrib_regression_redis_fdw" as user > "buildfarm". > ALTER EXTENSION "hstore" UPDATE; > ALTER EXTENSION > You are now connected to database "contrib_regression_btree_gin" as user > "buildfarm". > ALTER EXTENSION "btree_gin" UPDATE; > ALTER EXTENSION > ... > but I see no corresponding server log file containing these commands in the > failure log. > > When running the same test locally, I find these in inst/upgrade_log. > > Maybe uploading this log file too would help to understand what is the > cause of the failure... > Will investigate after I return from pgconf Cheers Andrew
Vacuum statistics
Hello, everyone! I think we don't have enough information to analyze vacuum functionality. Needless to say that the vacuum is the most important process for a database system. It prevents problems like table and index bloating and emergency freezing if we have a wraparound problem. Furthermore, it keeps the visibility map up to date. On the other hand, because of incorrectly adjusted aggressive settings of autovacuum it can consume a lot of computing resources that lead to all queries to the system running longer. Nowadays the vacuum gathers statistical information about tables, but it is important not for optimizer only. Because the vacuum is an automation process, there are a lot of settings that determine their aggressive functionality to other objects of the database. Besides, sometimes it is important to set a correct parameter for the specified table, because of its dynamic changes. An administrator of a database needs to set the settings of autovacuum to have a balance between the vacuum's useful action in the database system on the one hand, and the overhead of its workload on the other. However, it is not enough for him to decide on vacuum functionality through statistical information about the number of vacuum passes through tables and operational data from progress_vacuum, because it is available only during vacuum operation and does not provide a strategic overview over the considered period. To sum up, an automation vacuum has a strategic behavior because the frequency of its functionality and resource consumption depends on the workload of the database. Its workload on the database is minimal for an append-only table and it is a maximum for the table with a high-frequency updating. Furthermore, there is a high dependence of the vacuum load on the number and volume of indexes. Because of the absence of the visibility map for indexes, the vacuum scans the index completely, and the worst situation when it needs to do it during a bloating index situation in a small table. I suggest gathering information about vacuum resource consumption for processing indexes and tables and storing it in the table and index relationships (for example, PgStat_StatTabEntry structure like it has realized for usual statistics). It will allow us to determine how well the vacuum is configured and evaluate the effect of overhead on the system at the strategic level, the vacuum has gathered this information already, but this valuable information doesn't store it. -- Regards, Alena Rybakina Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [PATCH] psql: \dn+ to show size of each schema (and \dA+ for AMs)
Hi Justin, Thanks for the patch and the work on it. In reviewing the basic feature, I think this is something that has utility and is worthwhile at the high level. A few more specific notes: The pg_namespace_size() function can stand on its own, and probably has some utility for the released Postgres versions. I do think the psql implementation for the \dn+ or \dA+ commands shouldn't need to use this same function; it's a straightforward expansion of the SQL query that can be run in a way that will be backwards-compatible with any connected postgres version, so no reason to exclude this information for this cases. (This may have been in an earlier revision of the patchset; I didn't check everything.) I think the \dX++ command versions add code complexity without a real need for it. We have precedence with \l(+) to show permissions on the basic display and size on the extended display, and I think this is sufficient in this case here. While moving the permissions to \dn is a behavior change, it's adding information, not taking it away, and as an interactive command it is unlikely to introduce significant breakage in any scripting scenario. (In reviewing the patch we've also seen a bit of odd behavior/possible bug with the existing extended + commands, which introducing significant ++ overloading might be confusing, but not the fault/concern of this patch.) Quickie summary: 0001-Add-pg_am_size-pg_namespace_size.patch - fine, but needs rebase to work 0002-psql-add-convenience-commands-dA-and-dn.patch - split into just + variant; remove \l++ - make the \dn show permission and \dn+ show size 0003-f-convert-the-other-verbose-to-int-too.patch - unneeded 0004-Move-the-double-plus-Size-columns-to-the-right.patch - unneeded Docs on the first patch seemed fine; I do think we'll need docs changes for the psql changes. Best, David
Re: Vacuum statistics
On 30.05.2024 10:33, Alena Rybakina wrote: I suggest gathering information about vacuum resource consumption for processing indexes and tables and storing it in the table and index relationships (for example, PgStat_StatTabEntry structure like it has realized for usual statistics). It will allow us to determine how well the vacuum is configured and evaluate the effect of overhead on the system at the strategic level, the vacuum has gathered this information already, but this valuable information doesn't store it. My colleagues and I have prepared a patch that can help to solve this problem. We are open to feedback. -- Regards, Alena Rybakina Postgres Professional: http://www.postgrespro.com The Russian Postgres Company From dfda656b35be2a73c076cb723fdeb917630e61e3 Mon Sep 17 00:00:00 2001 From: Alena Rybakina Date: Thu, 30 May 2024 11:02:10 -0700 Subject: [PATCH] Machinery for grabbing an extended vacuum statistics on relations. Remember, statistic on heap and index relations a bit different. Value of total_blks_hit, total_blks_read, total_blks_dirtied are number of hitted, missed and dirtied pages in shared buffers during a vacuum operation respectively. total_blks_dirtied means 'dirtied only by this action'. So, if this page was dirty before the vacuum operation, it doesn't count this page as 'dirtied'. The tuples_deleted parameter is the number of tuples cleaned up by the vacuum operation. The delay_time value means total vacuum sleep time in vacuum delay point. The pages_removed value is the number of pages by which the physical data storage of the relation was reduced. The value of pages_deleted parameter is the number of freed pages in the table (file size may not have changed). Interruptions number of (auto)vacuum process during vacuuming of a relation. We report from the vacuum_error_callback routine. So we can log all ERROR reports. In the case of autovacuum we can report SIGINT signals too. It maybe dangerous to make such complex task (send) in an error callback - we can catch ERROR in ERROR problem. But it looks like we have so small chance to stuck into this problem. So, let's try to use. This parameter relates to a problem, covered by b19e4250. Tracking of IO during an (auto)vacuum operation. Introduced variables blk_read_time and blk_write_time tracks only access to buffer pages and flushing them to disk. Reading operation is trivial, but writing measurement technique is not obvious. So, during a vacuum writing time can be zero incremented because no any flushing operations were performed. System time and user time are parameters that describes how much time a vacuum operation has spent in executing of code in user space and kernel space accordingly. Also, accumulate total time of a vacuum that is a diff between timestamps in start and finish points in the vacuum code. Remember about idle time, when vacuum waited for IO and locks, so total time isn't equal a sum of user and system time, but no less. pages_frozen - number of pages that are marked as frozen in vm during vacuum. This parameter is incremented if page is marked as all-frozen. pages_all_visible - number of pages that are marked as all-visible in vm during vacuum. Authors: Alena Rybakina , Andrei Lepikhov , Andrei Zubkov --- src/backend/access/heap/vacuumlazy.c | 245 +- src/backend/access/heap/visibilitymap.c | 13 + src/backend/catalog/system_views.sql | 123 + src/backend/commands/vacuum.c | 4 + src/backend/commands/vacuumparallel.c | 1 + src/backend/utils/activity/pgstat.c | 117 +++-- src/backend/utils/activity/pgstat_database.c | 1 + src/backend/utils/activity/pgstat_relation.c | 52 ++- src/backend/utils/adt/pgstatfuncs.c | 290 src/backend/utils/error/elog.c| 13 + src/include/catalog/pg_proc.dat | 28 +- src/include/commands/vacuum.h | 1 + src/include/pgstat.h | 118 - src/include/utils/elog.h | 2 +- src/include/utils/pgstat_internal.h | 36 +- .../expected/vacuum-extended-statistic.out| 419 ++ .../isolation/expected/vacuum-extending.out | 68 +++ src/test/isolation/isolation_schedule | 2 + .../specs/vacuum-extended-statistic.spec | 179 .../isolation/specs/vacuum-extending.spec | 58 +++ src/test/regress/expected/opr_sanity.out | 9 +- src/test/regress/expected/rules.out | 79 22 files changed, 1812 insertions(+), 46 deletions(-) create mode 100644 src/test/isolation/expected/vacuum-extended-statistic.out create mode 100644 src/test/isolation/expected/vacuum-extending.out create mode 100644 src/test/isolation/specs/vacuum-extended-statistic.spec create mode 100644 src/test/isolation/specs/vacuum-extending.spec diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
Re: Vacuum statistics
Hi, Th, 30/05/2024 at 10:33 -0700, Alena Rybakina wrote: > I suggest gathering information about vacuum resource consumption for > processing indexes and tables and storing it in the table and index > relationships (for example, PgStat_StatTabEntry structure like it has > realized for usual statistics). It will allow us to determine how > well > the vacuum is configured and evaluate the effect of overhead on the > system at the strategic level, the vacuum has gathered this > information > already, but this valuable information doesn't store it. > It seems a little bit unclear to me, so let me explain a little the point of a proposition. As the vacuum process is a backend it has a workload instrumentation. We have all the basic counters available such as a number of blocks read, hit and written, time spent on I/O, WAL stats and so on.. Also, we can easily get some statistics specific to vacuum activity i.e. number of tuples removed, number of blocks removed, number of VM marks set and, of course the most important metric - time spent on vacuum operation. All those statistics must be stored by the Cumulative Statistics System on per-relation basis. I mean individual cumulative counters for every table and every index in the database. Such counters will provide us a clear view about vacuum workload on individual objects of the database, providing means to measure the efficiency of performed vacuum fine tuning. -- Andrei Zubkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Add memory context type to pg_backend_memory_contexts view
Hi David, Giving this a once-through, this seems straightforward and useful. I have a slight preference for keeping "name" the first field in the view and moving "type" to the second, but that's minor. Just confirming that the allocator types are not extensible without a recompile, since it's using a specific node tag to switch on, so there are no concerns with not properly displaying the output of something else. The "" text placeholder might be more appropriate as "", or perhaps stronger, include a WARNING in the logs, since an unknown tag at this point would be an indication of some sort of memory corruption. Since there are only four possible values, I think there would be utility in including them in the docs for this field. I also think it would be useful to have some sort of comments at least in mmgr/README to indicate that if a new type of allocator is introduced that you will also need to add the node to the function for this type, since it's not an automatic conversion. (For that matter, instead of switching on node type and outputting a given string, is there a generic function that could just give us the string value for node type so we don't need to teach anything else about it anyway?) Thanks, David
Re: Should consider materializing the cheapest inner path in consider_parallel_nestloop()
On Tue, 2024-04-23 at 16:59 +0800, Tender Wang wrote: > [ cut ] > > I have rebased master and fixed a plan diff case. We (me, Paul Jungwirth, and Yuki Fujii) reviewed this patch at PgConf.dev Patch Review Workshop. Here are our findings. Patch tries to allow for using materialization together with parallel subqueries. It applies cleanly on 8fea1bd5411b793697a4c9087c403887e050c4ac (current HEAD). Tests pass locally on macOS and Linux in VM under Windows. Tests are also green in cfbot (for last 2 weeks; they were red previously, probably because of need to rebase). Please add more tests. Especially please add some negative tests; current patch checks that it is safe to apply materialization. It would be helpful to add tests checking that materialization is not applied in both checked cases: 1. when inner join path is not parallel safe 2. when matpath is not parallel safe This patch tries to apply materialization only when join type is not JOIN_UNIQUE_INNER. Comment mentions this, but does not explain why. So please either add comment describing reason for that or try enabling materialization in such a case. Best regards. -- Tomasz Rybak, Debian Developer GPG: A565 CE64 F866 A258 4DDC F9C7 ECB7 3E37 E887 AA8C
Re: Reuse child_relids in try_partitionwise_join was Re: Assert failure on bms_equal(child_joinrel->relids, child_joinrelids)
Hello, I am looking through some outstanding CommitFest entries; I wonder if this particular patch is already effectively fixed by 5278d0a2, which is both attributed to the original author as well as an extremely similar approach. Can this entry (https://commitfest.postgresql.org/48/4553/) be closed? Best, David
Re: POC: GROUP BY optimization
Hi! On Thu, May 30, 2024 at 7:22 AM Andrei Lepikhov wrote: > On 5/29/24 19:53, Alexander Korotkov wrote: > > Thank you for your feedback. > > > > On Wed, May 29, 2024 at 11:08 AM Andrei Lepikhov > > wrote: > >> On 5/27/24 19:41, Alexander Korotkov wrote: > >>> Any thoughts? > >> About 0001: > >> Having overviewed it, I don't see any issues (but I'm the author), > >> except grammatical ones - but I'm not a native to judge it. > >> Also, the sentence 'turning GROUP BY clauses into pathkeys' is unclear > >> to me. It may be better to write something like: 'building pathkeys by > >> the list of grouping clauses'. > > > > OK, thank you. I'll run once again for the grammar issues. I've revised some grammar including the sentence you've proposed. > >> 0002: > >> The part under USE_ASSERT_CHECKING looks good to me. But the code in > >> group_keys_reorder_by_pathkeys looks suspicious: of course, we do some > >> doubtful work without any possible way to reproduce, but if we envision > >> some duplicated elements in the group_clauses, we should avoid usage of > >> the list_concat_unique_ptr. > > > > As I understand Tom, there is a risk that clauses list may contain > > multiple instances of equivalent SortGroupClause, not duplicate > > pointers. > Maybe. I just implemented the worst-case scenario with the intention of > maximum safety. So, it's up to you. > > > >> What's more, why do you not exit from > >> foreach_ptr immediately after SortGroupClause has been found? I think > >> the new_group_clauses should be consistent with the new_group_pathkeys. > > > > I wanted this to be consistent with preprocess_groupclause(), where > > duplicate SortGroupClause'es are grouped together. Otherwise, we > > could just delete redundant SortGroupClause'es. > Hm, preprocess_groupclause is called before the standard_qp_callback > forms the 'canonical form' of group_pathkeys and such behaviour needed. > But the code that chooses the reordering strategy uses already processed > grouping clauses, where we don't have duplicates in the first > num_groupby_pathkeys elements, do we? Yep, it seems that we work with group clauses which were processed by standard_qp_callback(). In turn standard_qp_callback() called make_pathkeys_for_sortclauses_extended() with remove_redundant == true. So, there shouldn't be duplicates. And it's unclear what should we be protected from. I leaved 0002 with just asserts. And extracted questionable changes into 0005. -- Regards, Alexander Korotkov Supabase v3-0004-Restore-preprocess_groupclause.patch Description: Binary data v3-0005-Teach-group_keys_reorder_by_pathkeys-about-redund.patch Description: Binary data v3-0002-Add-invariants-check-to-get_useful_group_keys_ord.patch Description: Binary data v3-0003-Rename-PathKeyInfo-to-GroupByOrdering.patch Description: Binary data v3-0001-Fix-asymmetry-in-setting-EquivalenceClass.ec_sort.patch Description: Binary data
Re: Add memory context type to pg_backend_memory_contexts view
On Fri, 31 May 2024 at 07:21, David Christensen wrote: > Giving this a once-through, this seems straightforward and useful. I > have a slight preference for keeping "name" the first field in the > view and moving "type" to the second, but that's minor. Not having it first make sense, but I don't think putting it between name and ident is a good idea. I think name and ident belong next to each other. parent likely should come after those since that also relates to the name. How about putting it after "parent"? > Just confirming that the allocator types are not extensible without a > recompile, since it's using a specific node tag to switch on, so there > are no concerns with not properly displaying the output of something > else. They're not extensible. > The "" text placeholder might be more appropriate as "", > or perhaps stronger, include a WARNING in the logs, since an unknown > tag at this point would be an indication of some sort of memory > corruption. This follows what we do in other places. If you look at explain.c, you'll see lots of "???"s. I think if you're worried about corrupted memory, then garbled output in pg_get_backend_memory_contexts wouldn't be that high on the list of concerns. > Since there are only four possible values, I think there would be > utility in including them in the docs for this field. I'm not sure about this. We do try not to expose too much internal detail in the docs. I don't know all the reasons for that, but at least one reason is that it's easy for things to get outdated as code evolves. I'm also unsure how much value there is in listing 4 possible values unless we were to document the meaning of each of those values, and doing that puts us even further down the path of detailing Postgres internals in the documents. I don't think that's a maintenance burden that's often worth the trouble. > I also think it > would be useful to have some sort of comments at least in mmgr/README > to indicate that if a new type of allocator is introduced that you > will also need to add the node to the function for this type, since > it's not an automatic conversion. I don't think it's sustainable to do this. If we have to maintain documentation that lists everything you must do in order to add some new node types then I feel it's just going to get outdated as soon as someone adds something new that needs to be done. I'm only one developer, but for me, I'd not even bother looking there if I was planning to add a new memory context type. What I would be doing is searching the entire code base for where special handling is done for the existing types and ensuring I consider if I need to include a case for the new node type. In this case, I'd probably choose to search for "T_AllocSetContext", and I'd quickly land on PutMemoryContextsStatsTupleStore() and update it. This method doesn't get outdated, provided you do "git pull" occasionally. > (For that matter, instead of > switching on node type and outputting a given string, is there a > generic function that could just give us the string value for node > type so we don't need to teach anything else about it anyway?) There isn't. nodeToString() does take some node types as inputs and serialise those to something JSON-like, but that includes serialising each field of the node type too. The memory context types are not handled by those functions. I think it's fine to copy what's done in explain.c. "git grep \"???\" -- *.c | wc -l" gives me 31 occurrences, so I'm not doing anything new. I've attached an updated patch which changes the position of the new column in the view. Thank you for the review. David v3-0001-Add-context-type-field-to-pg_backend_memory_conte.patch Description: Binary data
Explicit specification of index ensuring uniqueness of foreign columns
I'd like to resurrect a subset of my proposal in [1], specifically that: The FOREIGN KEY constraint syntax gains a [ USING INDEX index_name ] clause optionally following the referenced column list. The index specified by this clause is used to support the foreign key constraint, and it must be a non-deferrable unique or primary key index on the referenced table compatible with the referenced columns. I believe that it may be independently valuable to have some syntax available to influence which index is used to ensure uniqueness of the foreign columns in a foreign key constraint. Currently, this index is identified implicitly from the REFERENCEd columns when the constraint is created. This causes the following to imperfectly round trip through a pg_dump and restore: CREATE TABLE foo ( id INTEGER NOT NULL GENERATED BY DEFAULT AS IDENTITY ); CREATE UNIQUE INDEX foo_key2 ON foo(id); CREATE UNIQUE INDEX foo_key1 ON foo(id); CREATE TABLE bar ( foo_id INTEGER NOT NULL CONSTRAINT bar_fkey REFERENCES foo(id) ); Using this query to identify the unique index backing the bar_fkey constraint: SELECT objid, refobjid::regclass FROM pg_depend WHERE objid = ( SELECT oid FROM pg_constraint WHERE conname = 'bar_fkey' ) AND refobjsubid = 0; Then after the DDL is applied, the foreign key constraint depends on foo_key2: objid | refobjid ---+-- 17152 | foo_key2 But following a pg_dump and restore, the foreign key's unique index dependency has changed to foo_key1: objid | refobjid ---+-- 17167 | foo_key1 This discrepancy appears to be caused by this confluence of circumstances: 1. The unique index backing the foreign side of a foreign key constraint is searched for in OID order: static Oid transformFkeyCheckAttrs(Relation pkrel, int numattrs, int16 *attnums, Oid *opclasses) /* output parameter */ { ... indexoidlist = RelationGetIndexList(pkrel); foreach(indexoidscan, indexoidlist) { ... } 2. The indexes appear in the pg_dump output before the FOREIGN KEY constraint, and they appear in lexicographic, rather than OID, order. While, in this minimal reproduction, the two indexes are interchangeable, there are situations that may reasonably occur in the course of ordinary use in which they aren't. For example, a redundant unique index with different storage parameters may exist during the migration of an application schema. If the incorrect index is then selected to be a dependency of a foreign key constraint following a pg_dump and restore, it will likely cause subsequent steps in the migration to fail. Note that this proposal deals with indexes rather than constraints because this is, internally, what PostgreSQL uses. Specifically, PostgreSQL doesn’t actually require there to be a unique constraint on the foreign columns of a foreign key constraint; a unique index alone is sufficient. However, I believe that this proposal would be essentially the same if it were changed to a USING CONSTRAINT clause, since it is already possible to explicitly specify the underlying index for a unique or primary key constraint. If I submitted a patch implementing this functionality, would there be any interest in it? [1]: https://www.postgresql.org/message-id/flat/CA%2BCLzG8HZUk8Gb9BKN88fgdSEqHx%3D2iq5aDdvbz7JqSFjA2WxA%40mail.gmail.com
Cluster forcefully removal without user input
Hello, I have installed PostgreSQL 15 and PostgreSQL 14 side by side and want to upgrade from 14 to 15. For upgrading purposes, I am using {postgresql-15-setup check_upgrade}. However, in my case, the installed 14 version is not compatible with the latest 15.7. After the installation and cluster initialization of PostgreSQL 14 and 15, when I run the following command {postgresql-15-setup check_upgrade}, it returns the following message: "Performing upgrade check: Upgrade failed. Removing the new cluster. Please re-initdb the new cluster. failed " After the failure the postgresql15 cluster removed forcefully due to the following code written in postgresql-15-setup script file { if [ $script_result -eq 0 ]; then echo $"OK" else # Clean up after failure echo "Upgrade failed. Removing the new cluster. Please re-initdb the new cluster." *rm -rf "$PGDATA"*echo $"failed" fi } My concern here is whether forcefully deleting the user cluster without obtaining permission from the user is the right approach. Regards, Zaid Shabbir AGEDB
Re: Use generation memory context for tuplestore.c
On Sat, 4 May 2024 at 03:51, Matthias van de Meent wrote: > > On Fri, 3 May 2024 at 15:55, David Rowley wrote: > > master @ 8f0a97dff > > Storage: Memory Maximum Storage: 16577kB > > > > patched: > > Storage: Memory Maximum Storage: 8577kB > > Those are some impressive numbers. This patch needed to be rebased, so updated patches are attached. I was also reflecting on what Bruce wrote in [1] about having to parse performance numbers from the commit messages, so I decided to adjust the placeholder commit message I'd written to make performance numbers more clear to Bruce, or whoever does the next major version release notes. That caused me to experiment with finding the best case for this patch. I could scale the improvement much further than I have, but here's something I came up with that's easy to reproduce. create table winagg (a int, b text); insert into winagg select a,repeat('a', 1024) from generate_series(1,1000)a; set work_mem = '1MB'; set jit=0; explain (analyze, timing off) select sum(l1),sum(l2) from (select length(b) l1,length(lag(b, 800) over ()) as l2 from winagg limit 1600); master: Execution Time: 6585.685 ms patched: Execution Time: 4.159 ms 1583x faster. I've effectively just exploited the spool_tuples() behaviour of what it does when the tuplestore goes to disk to have it spool the entire remainder of the partition, which is 10 million rows. I'm just taking a tiny portion of those with the LIMIT 1600. I just set work_mem to something that the patched version won't have the tuplestore spill to disk so that spool_tuples() only spools what's needed in the patched version. So, artificial is a word you could use, but certainly, someone could find this performance cliff in the wild and be prevented from falling off it by this patch. David [1] https://www.postgresql.org/message-id/Zk5r2XyI0BhXLF8h%40momjian.us v2-0001-Add-memory-disk-usage-for-Material-in-EXPLAIN-ANA.patch Description: Binary data v2-0002-Optimize-memory-management-and-increase-performan.patch Description: Binary data