Re: a wrong index choose when statistics is out of date
On 5/3/2024 19:56, Andy Fan wrote: I think it is OK for a design review, for the implementaion side, the known issue includes: 1. Support grap such infromation from its parent for partitioned table if the child doesn't have such information. 2. builtin document and testing. Any feedback is welcome. Thanks for your efforts. I was confused when you showed the problem connected to clauses like "Var op Const" and "Var op Param". As far as I know, the estimation logic of such clauses uses MCV and number-distinct statistics. So, being out of MCV values, it becomes totally insensitive to any internal skew in data and any data outside the statistics boundaries. Having studied the example you provided with the patch, I think it is not a correct example: Difference between var_eq_const and var_eq_non_const quite obvious: In the second routine, you don't have information about the const value and can't use MCV for estimation. Also, you can't exclude MCV values from the estimation. And it is just luck that you've got the right answer. I think if you increased the weight of the unknown part, you would get a bad result, too. I would like to ask David why the var_eq_const estimator doesn't have an option for estimation with a histogram. Having that would relieve a problem with skewed data. Detecting the situation with incoming const that is out of the covered area would allow us to fall back to ndistinct estimation or something else. At least, histogram usage can be restricted by the reltuples value and ratio between the total number of MCV values and the total number of distinct values in the table. Just for demo: demonstration of data skew issue: CREATE EXTENSION tablefunc; CREATE TABLE norm_test AS SELECT abs(r::integer) AS val FROM normal_rand(1E7::integer, 5.::float8, 300.::float8) AS r; ANALYZE norm_test; -- First query is estimated with MCV quite precisely: EXPLAIN ANALYZE SELECT * FROM norm_test WHERE val = 100; -- result: planned rows=25669, actual rows=25139 -- Here we have numdistinct estimation, mostly arbitrary: EXPLAIN ANALYZE SELECT * FROM norm_test WHERE val = 200; -- result: planned rows=8604, actual rows=21239 EXPLAIN ANALYZE SELECT * FROM norm_test WHERE val = 500; -- result: planned rows=8604, actual rows=6748 EXPLAIN ANALYZE SELECT * FROM norm_test WHERE val = 600; -- result: planned rows=8604, actual rows=3501 EXPLAIN ANALYZE SELECT * FROM norm_test WHERE val = 700; -- result: planned rows=8604, actual rows=1705 EXPLAIN ANALYZE SELECT * FROM norm_test WHERE val = 1000; -- result: planned rows=8604, actual rows=91 -- regards, Andrei Lepikhov Postgres Professional
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Mar 7, 2024 at 1:49 PM Masahiko Sawada wrote: > odr-violation seems to refer to One Definition Rule (ODR). According > to Wikipedia[1]: > > The One Definition Rule (ODR) is an important rule of the C++ > programming language that prescribes that classes/structs and > non-inline functions cannot have more than one definition in the > entire program and template and types cannot have more than one > definition by translation unit. It is defined in the ISO C++ Standard > (ISO/IEC 14882) 2003, at section 3.2. Some other programming languages > have similar but differently defined rules towards the same objective. > > I don't fully understand this concept yet but are these two different > build failures related? I thought it may have something to do with the prerequisite commit that moved some symbols from bitmapset.c to .h: /* Select appropriate bit-twiddling functions for bitmap word size */ #if BITS_PER_BITMAPWORD == 32 #define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w) #define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w) #define bmw_popcount(w) pg_popcount32(w) #elif BITS_PER_BITMAPWORD == 64 #define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w) #define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w) #define bmw_popcount(w) pg_popcount64(w) #else #error "invalid BITS_PER_BITMAPWORD" #endif ...but olingo's error seems strange to me, because it is complaining of pg_leftmost_one_pos, which refers to the lookup table in pg_bitutils.c -- I thought all buildfarm members used the bitscan instructions. grassquit is complaining of pg_popcount64, which is a global function, also in pg_bitutils.c. Not sure what to make of this, since we're just pointing symbols at things which should have a single definition...
Re: Stack overflow issue
Hi, Egor! On Thu, Mar 7, 2024 at 9:53 AM Egor Chindyaskin wrote: > > > 6 march 2024 г., at 19:17, Alexander Korotkov wrote: > > > > The revised set of remaining patches is attached. > > > > 0001 Turn tail recursion into iteration in CommitTransactionCommand() > > I did minor revision of comments and code blocks order to improve the > > readability. > > > > 0002 Avoid stack overflow in ShowTransactionStateRec() > > I didn't notice any issues, leave this piece as is. > > > > 0003 Avoid recursion in MemoryContext functions > > I've renamed MemoryContextTraverse() => MemoryContextTraverseNext(), > > which I think is a bit more intuitive. Also I fixed > > MemoryContextMemConsumed(), which was still trying to use the removed > > argument "print" of MemoryContextStatsInternal() function. > > > > Generally, I think this patchset fixes important stack overflow holes. > > It is quite straightforward, clear and the code has a good shape. I'm > > going to push this if no objections. > > I have tested the scripts from message [1]. After applying these patches and > Tom Lane’s patch from message [2], all of the above mentioned functions no > longer caused the server to crash. I also tried increasing the values in the > presented scripts, which also did not lead to server crashes. Thank you! > Also, I would like to clarify something. Will fixes from message [3] and > others be backported to all other branches, not just the master branch? As > far as I remember, Tom Lane made corrections to all branches. For example [4]. > > Links: > 1. > https://www.postgresql.org/message-id/343ff14f-3060-4f88-9cc6-efdb390185df%40mail.ru > 2. https://www.postgresql.org/message-id/386032.1709765547%40sss.pgh.pa.us > 3. > https://www.postgresql.org/message-id/CAPpHfduZqAjF%2B7rDRP-RGNHjOXy7nvFROQ0MGS436f8FPY5DpQ%40mail.gmail.com > 4. https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=e07ebd4b Thank you for your feedback! Initially I didn't intend to backpatch any of these. But on second thought with the references you provided, I think we should backpatch simple check_stack_depth() checks from d57b7cc333 to all supported branches, but apply refactoring of memory contextes and transaction commit/abort just to master. Opinions? -- Regards, Alexander Korotkov
Re: speed up a logical replica setup
On Thu, 7 Mar 2024 at 10:05, Euler Taveira wrote: > > On Wed, Mar 6, 2024, at 7:02 AM, Hayato Kuroda (Fujitsu) wrote: > > Thanks for updating the patch! > > > Thanks for the feedback. I'm attaching v26 that addresses most of your > comments > and some issues pointed by Vignesh [1]. Few comments: 1) We are disconnecting database again in error case too, it will lead to a double free in this scenario, + PQclear(res); + + disconnect_database(conn, false); + + if (max_repslots < num_dbs) + { + pg_log_error("subscriber requires %d replication slots, but only %d remain", +num_dbs, max_repslots); + pg_log_error_hint("Consider increasing max_replication_slots to at least %d.", + num_dbs); + disconnect_database(conn, true); + } + + if (max_lrworkers < num_dbs) + { + pg_log_error("subscriber requires %d logical replication workers, but only %d remain", +num_dbs, max_lrworkers); + pg_log_error_hint("Consider increasing max_logical_replication_workers to at least %d.", + num_dbs); + disconnect_database(conn, true); + } pg_createsubscriber: error: subscriber requires 5 logical replication workers, but only 4 remain pg_createsubscriber: hint: Consider increasing max_logical_replication_workers to at least 5. free(): double free detected in tcache 2 Aborted 2) We can also check that the primary is not using synchronous_standby_names, else all the transactions in the primary will wait infinitely once the standby server is stopped, this could be added in the documentation: +/* + * Is the primary server ready for logical replication? + */ +static void +check_publisher(struct LogicalRepInfo *dbinfo) +{ + PGconn *conn; + PGresult *res; + + char *wal_level; + int max_repslots; + int cur_repslots; + int max_walsenders; + int cur_walsenders; + + pg_log_info("checking settings on publisher"); + + conn = connect_database(dbinfo[0].pubconninfo, true); + + /* +* If the primary server is in recovery (i.e. cascading replication), +* objects (publication) cannot be created because it is read only. +*/ + if (server_is_in_recovery(conn)) + { + pg_log_error("primary server cannot be in recovery"); + disconnect_database(conn, true); + } 3) This check is present only for publication, we do not have this in case of creating a subscription. We can keep both of them similar, i.e. have the check in both or don't have the check for both publication and subscription: + /* Check if the publication already exists */ + appendPQExpBuffer(str, + "SELECT 1 FROM pg_catalog.pg_publication " + "WHERE pubname = '%s'", + dbinfo->pubname); + res = PQexec(conn, str->data); + if (PQresultStatus(res) != PGRES_TUPLES_OK) + { + pg_log_error("could not obtain publication information: %s", +PQresultErrorMessage(res)); + disconnect_database(conn, true); + } + 4) Few options are missing: + + + pg_createsubscriber + option + + + -d + --database + +dbname + + -D + --pgdata + +datadir + + -P + --publisher-server + +connstr + + + 4a) -n, --dry-run 4b) -p, --subscriber-port 4c) -r, --retain 4d) -s, --socket-directory 4e) -t, --recovery-timeout 4f) -U, --subscriber-username 5) typo connnections should be connections + +The port number on which the target server is listening for connections. +Defaults to running the target server on port 50432 to avoid unintended +client connnections. 6) repliation should be replication + * Create the subscriptions, adjust the initial location for logical + * replication and enable the subscriptions. That's the last step for logical + * repliation setup. 7) I did not notice these changes in the latest patch: diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list index d808aad8b0..08de2bf4e6 100644 --- a/src/tools/pgindent/typedefs.list +++ b/src/tools/pgindent/typedefs.list @@ -517,6 +517,7 @@ CreateSeqStmt CreateStatsStmt CreateStmt CreateStmtContext +CreateSubscriberOptions CreateSubscriptionStmt CreateTableAsStmt CreateTableSpaceStmt @@ -1505,6 +1506,7 @@ LogicalRepBeginData LogicalRepCommitData LogicalRepCommitPreparedTxnData LogicalRepCtxStruct +LogicalRepInfo LogicalRepMsgType LogicalRepPartMapEntry LogicalRepPreparedTxnData Regards, Vignes
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Mar 7, 2024 at 1:19 PM John Naylor wrote: > > In addition, olingo and grassquit are showing different kinds of > "AddressSanitizer: odr-violation" errors, which I'm not sure what to > make of -- example: This might be relevant: $ git grep 'link_with: pgport_srv' src/test/modules/test_radixtree/meson.build: link_with: pgport_srv, No other test module uses this directive, and indeed, removing this still builds fine for me. Thoughts?
Re: 035_standby_logical_decoding unbounded hang
> On 20 Feb 2024, at 04:09, Noah Misch wrote: > I’m not sure if it is connected, but so far many patches in CFbot keep hanging in this test. For example [0]. I haven’t done root cause analysis yet, but hangs may be related to this thread. Maybe someone more familiar with similar issues could take a look there? Thanks! Best regards, Andrey Borodin. [0] https://cirrus-ci.com/task/5911176840740864?logs=check_world#L292
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Mar 7, 2024 at 6:37 PM John Naylor wrote: > > On Thu, Mar 7, 2024 at 1:19 PM John Naylor wrote: > > > > In addition, olingo and grassquit are showing different kinds of > > "AddressSanitizer: odr-violation" errors, which I'm not sure what to > > make of -- example: > > This might be relevant: > > $ git grep 'link_with: pgport_srv' > src/test/modules/test_radixtree/meson.build: link_with: pgport_srv, > > No other test module uses this directive, and indeed, removing this > still builds fine for me. Thoughts? Yeah, it could be the culprit. The test_radixtree/meson.build is the sole extension that explicitly specifies a link with pgport_srv. I think we can get rid of it as I've also confirmed the build still fine even without it. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: "type with xxxx does not exist" when doing ExecMemoize()
On Thu, 7 Mar 2024 at 15:24, Tender Wang wrote: > > Andrei Lepikhov 于2024年3月6日周三 11:37写道: >> I think, it is a bug. Should it be fixed (and back-patched) earlier? > > Agreed. Need David to review it as he knows this area best. This is on my list of things to do. Just not at the top yet. David
Re: Function and Procedure with same signature?
Hi Tom On Sat, Feb 10, 2024 at 12:38 AM Tom Lane wrote: > > "David G. Johnston" writes: > > On Fri, Feb 9, 2024, 12:05 Deepak M wrote: > >> Folks, When tried to create a function with the same signature as > >> procedure it fails. > > > That seems like a good hint you cannot do it. Specifically because they > > get defined in the same internal catalog within which names must be unique. The fact that we currently enforce it this way seems like an implementation deficiency that should be fixed. Bringing this up again, as there seems to be no good reason to have this restriction as there is no place in the call syntax where it would be unclear if a FUNCTION or a PROCEDURE is used. And at least Oracle does allow this (and possibly others) so having this unnecessary restriction in PostgreSQL makes migrations from Oracle applications where this feature is used needlessly complicated. > Worth noting perhaps that this is actually required by the SQL > standard: per spec, functions and procedures are both "routines" > and share the same namespace, Can you point me to a place in the standard where it requires all kinds of ROUTINES to be using the same namespace ? All I could find was that ROUTINES are either FUNCTIONS, PROCEDURES or METHODS and then samples of their usage which made clear that all three are different and usage is disjoint at syntax level. As for DROP ROUTINE we could just raise an error and recommend using more specific DROP FUNCTION or DROP PROCEDURE if there is ambiguity. -- Best Regards Hannu
Re: a wrong index choose when statistics is out of date
On Wed, 6 Mar 2024 at 02:09, Andy Fan wrote: > This patch introduces a new attoptions like this: > > ALTER TABLE t ALTER COLUMN col set (force_generic=true); > > Then selfunc.c realizes this and ignore the special Const value, then > average selectivity is chosen. This fall into the weakness of generic > plan, but this patch doesn't introduce any new weakness and we leave the > decision to user which could resolve some problem. Also this logic only > apply to eqsel since the ineqsel have the get_actual_variable_range > mechanism which is helpful for index choose case at least. If you don't want the planner to use the statistics for the column why not just do the following? ALTER TABLE t ALTER COLUMN col SET STATISTICS 0; ANALYZE won't delete any existing statistics, so that might need to be done manually. David
Re: [EXTERNAL] Re: Add non-blocking version of PQcancel
Attached is a new patchset with various changes. I created a dedicated 0002 patch to add tests for the already existing cancellation functions, because that seemed useful for another thread where changes to the cancellation protocol are being proposed[1]. [1]: https://www.postgresql.org/message-id/flat/508d0505-8b7a-4864-a681-e7e5edfe32aa%40iki.fi On Wed, 6 Mar 2024 at 19:22, Alvaro Herrera wrote: > > Docs: one bogus "that that". This was already fixed by my previous doc changes in v32, I guess that email got crossed with this one > Did we consider having PQcancelConn() instead be called > PQcancelCreate()? Done > "Asynchronously cancel a query on the given connection. This requires > polling the returned PGcancelConn to actually complete the cancellation > of the query." but this is no longer a good description of what this > function does. Fixed > Anyway, maybe there are reasons for this; but in any case we > should set ->cancelRequest in all cases, not only after the first tests > for errors. Done > I think the extra PGconn inside pg_cancel_conn is useless; it would be > simpler to typedef PGcancelConn to PGconn in fe-cancel.c, and remove the > indirection through the extra struct. You're actually dereferencing the > object in two ways in the new code, both by casting the outer object > straight to PGconn (taking advantage that the struct member is first in > the struct), and by using PGcancelConn->conn. This seems pointless. I > mean, if we're going to cast to "PGconn *" in some places anyway, then > we may as well access all members directly. Perhaps, if you want, you > could add asserts that ->cancelRequest is set true in all the > fe-cancel.c functions. Anyway, we'd still have compiler support to tell > you that you're passing the wrong struct to the function. (I didn't > actually try to change the code this way, so I might be wrong.) Turns out you were wrong about the compiler support to tell us we're passing the wrong struct: When both the PGconn and PGcancelConn typedefs refer to the same struct, the compiler allows passing PGconn to PGcancelConn functions and vice versa without complaining. This seems enough reason for me to keep indirection through the extra struct. So instead of adding the proposed typed this typedef I chose to add a comment to pg_cancel_conn explaining its purpose, as well as not casting PGcancelConn to PGconn but always accessing the conn field for consistency. > We could move the definition of struct pg_cancel to fe-cancel.c. Nobody > outside that needs to know that definition anyway. Done in 0003 v33-0003-libpq-Move-pg_cancel-to-fe-cancel.c.patch Description: Binary data v33-0002-Add-tests-for-libpq-query-cancellation-APIs.patch Description: Binary data v33-0001-Add-missing-connection-statuses-to-docs.patch Description: Binary data v33-0005-Start-using-new-libpq-cancel-APIs.patch Description: Binary data v33-0004-libpq-Add-encrypted-and-non-blocking-versions-of.patch Description: Binary data
Re: bug report: some issues about pg_15_stable(8fa4a1ac61189efffb8b851ee77e1bc87360c445)
On Tue, 5 Mar 2024 at 13:55, Dean Rasheed wrote: > > > If I only execute merge , I will get the following error: > > merge into tgt a using src1 c on a.a = c.a when matched then update > > set b = c.b when not matched then insert (a,b) values(c.a,c.b); -- excute > > fail > > ERROR: MERGE command cannot affect row a second time > > HIINT: Ensure that not more than one source row matches any one target > > row. > > > > But when I execute the update and merge concurrently, I will get the > > following result set. > > Yes, this should still produce that error, even after a concurrent update. > > My initial reaction is that neither of those blocks of code is > entirely correct, and that they should both be doing both of those > checks. I.e., something like the attached (which probably needs some > additional test cases). > OK, I've pushed and back-patched that fix for this issue, after adding some tests (nice catch, by the way!). That wasn't related to the original issue though, so the problem with UNION ALL still remains to be fixed. The patch from [1] looks promising (for HEAD at least), but it really needs more pairs of eyes on it (bearing in mind that it's just a rough proof-of-concept patch at this stage). [1] https://www.postgresql.org/message-id/CAEZATCVa-mgPuOdgd8%2BYVgOJ4wgJuhT2mJznbj_tmsGAP8hHJw%40mail.gmail.com Regards, Dean
Re: a wrong index choose when statistics is out of date
On Thu, 7 Mar 2024 at 21:17, Andrei Lepikhov wrote: > I would like to ask David why the var_eq_const estimator doesn't have an > option for estimation with a histogram. Having that would relieve a > problem with skewed data. Detecting the situation with incoming const > that is out of the covered area would allow us to fall back to ndistinct > estimation or something else. At least, histogram usage can be > restricted by the reltuples value and ratio between the total number of > MCV values and the total number of distinct values in the table. If you can think of a way how to calculate it, you should propose a patch. IIRC, we try to make the histogram buckets evenly sized based on the number of occurrences. I've not followed the code in default, I'd guess that doing that allows us to just subtract off the MCV frequencies and assume the remainder is evenly split over each histogram bucket, so unless we had an n_distinct per histogram bucket, or at the very least n_distinct_for_histogram_values, then how would the calculation look for what we currently record? David
Re: a wrong index choose when statistics is out of date
David Rowley writes: > On Wed, 6 Mar 2024 at 02:09, Andy Fan wrote: >> This patch introduces a new attoptions like this: >> >> ALTER TABLE t ALTER COLUMN col set (force_generic=true); >> >> Then selfunc.c realizes this and ignore the special Const value, then >> average selectivity is chosen. This fall into the weakness of generic >> plan, but this patch doesn't introduce any new weakness and we leave the >> decision to user which could resolve some problem. Also this logic only >> apply to eqsel since the ineqsel have the get_actual_variable_range >> mechanism which is helpful for index choose case at least. > > If you don't want the planner to use the statistics for the column why > not just do the following? Acutally I didn't want the planner to ignore the statistics totally, I want the planner to treat the "Const" which probably miss optimizer part average, which is just like what we did for generic plan for the blow query. prepare s as SELECT * FROM t WHERE a = $1 and b = $2; explain (costs off) execute s(109, 8); QUERY PLAN - Index Scan using t_a_c_idx on t Index Cond: (a = 109) Filter: (b = 8) (3 rows) custom plan, Wrong index due to we have a bad estimation for a = 109. set plan_cache_mode to force_generic_plan ; explain (costs off) execute s(109, 8); QUERY PLAN --- Index Scan using t_a_b_idx on t Index Cond: ((a = $1) AND (b = $2)) -- Correct index. (2 rows) Generic plan - we use the average estimation for the missed optimizer statistics part and *if the new value is not so different from existing ones*, we can get a disired result. It is true that the "generic" way is not as exactly accurate as the "custom" way since the later one can use the data in MCV, but that is the cost we have to pay to make the missed optimizer statistics less imporant and generic plan has the same issue as well. As for this aspect, I think the way you proposed probably have a wider use case. -- Best Regards Andy Fan
Re: a wrong index choose when statistics is out of date
Andrei Lepikhov writes: > On 5/3/2024 19:56, Andy Fan wrote: >> I think it is OK for a design review, for the implementaion side, the >> known issue includes: >> 1. Support grap such infromation from its parent for partitioned table >> if the child doesn't have such information. >> 2. builtin document and testing. >> Any feedback is welcome. > Thanks for your efforts. > I was confused when you showed the problem connected to clauses like > "Var op Const" and "Var op Param". hmm, then what is the soluation in your mind when you say the "ticky" in [1]? I am thinking we have some communication gap here. > As far as I know, the estimation logic of such clauses uses MCV and > number-distinct statistics. So, being out of MCV values, it becomes > totally insensitive to any internal skew in data and any data outside > the statistics boundaries. > Having studied the example you provided with the patch, I think it is > not a correct example: > Difference between var_eq_const and var_eq_non_const quite obvious: The response should be same as what I did in [2], let's see if we can make the gap between us smaller. > In the second routine, you don't have information about the const value > and can't use MCV for estimation. Also, you can't exclude MCV values > from the estimation. And it is just luck that you've got the right > answer. I think if you increased the weight of the unknown part, you > would get a bad result, too. > I would like to ask David why the var_eq_const estimator doesn't have an > option for estimation with a histogram. Having that would relieve a > problem with skewed data. Detecting the situation with incoming const > that is out of the covered area would allow us to fall back to ndistinct > estimation or something else. At least, histogram usage can be > restricted by the reltuples value and ratio between the total number of > MCV values and the total number of distinct values in the table. I think an example which show your algorithm is better would be pretty helpful for communication. [1] https://www.postgresql.org/message-id/15381eea-cbc3-4087-9d90-ab752292bd54%40postgrespro.ru [2] https://www.postgresql.org/message-id/87msra9vgo.fsf%40163.com -- Best Regards Andy Fan
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Mar 7, 2024 at 4:47 PM Masahiko Sawada wrote: > > On Thu, Mar 7, 2024 at 6:37 PM John Naylor wrote: > > $ git grep 'link_with: pgport_srv' > > src/test/modules/test_radixtree/meson.build: link_with: pgport_srv, > > > > No other test module uses this directive, and indeed, removing this > > still builds fine for me. Thoughts? > > Yeah, it could be the culprit. The test_radixtree/meson.build is the > sole extension that explicitly specifies a link with pgport_srv. I > think we can get rid of it as I've also confirmed the build still fine > even without it. olingo and grassquit have turned green, so that must have been it.
Re: Invalid query generated by postgres_fdw with UNION ALL and ORDER BY
On Thu, 7 Mar 2024 at 19:09, Michał Kłeczek wrote: > > The following query: > > SELECT * FROM ( > SELECT 2023 AS year, * FROM remote_table_1 > UNION ALL > SELECT 2022 AS year, * FROM remote_table_2 > ) > ORDER BY year DESC; > > yields the following remote query: > > SELECT [columns] FROM remote_table_1 ORDER BY 2023 DESC > > and subsequently fails remote execution. > > > Not really sure where the problem is - the planner or postgres_fdw. > I guess it is postgres_fdw not filtering out ordering keys. Interesting. I've attached a self-contained recreator for the casual passerby. I think the fix should go in appendOrderByClause(). It's at that point we look for the EquivalenceMember for the relation and can easily discover if the em_expr is a Const. I think we can safely just skip doing any ORDER BY stuff and not worry about if the literal format of the const will appear as a reference to an ordinal column position in the ORDER BY clause. Something like the attached patch I think should work. I wonder if we need a test... David diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c index 8fc66fa11c..2f4ed33173 100644 --- a/contrib/postgres_fdw/deparse.c +++ b/contrib/postgres_fdw/deparse.c @@ -3914,11 +3914,11 @@ appendOrderByClause(List *pathkeys, bool has_final_sort, int nestlevel; const char *delim = " "; StringInfo buf = context->buf; + boolgotone = false; /* Make sure any constants in the exprs are printed portably */ nestlevel = set_transmission_modes(); - appendStringInfoString(buf, " ORDER BY"); foreach(lcell, pathkeys) { PathKey*pathkey = lfirst(lcell); @@ -3949,6 +3949,21 @@ appendOrderByClause(List *pathkeys, bool has_final_sort, if (em == NULL) elog(ERROR, "could not find pathkey item to sort"); + /* +* If the member just has a Const expression then we needn't add it to +* the ORDER BY clause. This can happen in UNION ALL queries where the +* union child targetlist has a Const. Adding these would be wasteful, +* but also, for INT columns, putting an integer literal will be seen +* as an ordinal column position rather than a value to sort by, so we +* must skip these. +*/ + if (IsA(em->em_expr, Const)) + continue; + + if (!gotone) + appendStringInfoString(buf, " ORDER BY"); + gotone = true; + em_expr = em->em_expr; /* demo_of_postgres_fdw_order_by_const_bug.sql Description: Binary data
Re: remaining sql/json patches
On 3/7/24 06:18, Himanshu Upadhyaya wrote: > On Wed, Mar 6, 2024 at 9:04 PM Tomas Vondra > wrote: > >> >> >> I'm pretty sure this is the correct & expected behavior. The second >> query treats the value as string (because that's what should happen for >> values in double quotes). >> >> ok, Then why does the below query provide the correct conversion, even if > we enclose that in double quotes? > ‘postgres[102531]=#’SELECT * FROM JSON_TABLE(jsonb '{ > "id" : "1234567890", > "FULL_NAME" : "JOHN DOE"}', > '$' > COLUMNS( > name varchar(20) PATH 'lax $.FULL_NAME', > id int PATH 'lax $.id' > ) >) > ; >name | id > --+ > JOHN DOE | 1234567890 > (1 row) > > and for bigger input(string) it will leave as empty as below. > ‘postgres[102531]=#’SELECT * FROM JSON_TABLE(jsonb '{ > "id" : "12345678901", > "FULL_NAME" : "JOHN DOE"}', > '$' > COLUMNS( > name varchar(20) PATH 'lax $.FULL_NAME', > id int PATH 'lax $.id' > ) >) > ; >name | id > --+ > JOHN DOE | > (1 row) > > seems it is not something to do with data enclosed in double quotes but > somehow related with internal casting it to integer and I think in case of > bigger input it is not able to cast it to integer(as defined under COLUMNS > as id int PATH 'lax $.id') > > ‘postgres[102531]=#’SELECT * FROM JSON_TABLE(jsonb '{ > "id" : "12345678901", > "FULL_NAME" : "JOHN DOE"}', > '$' > COLUMNS( > name varchar(20) PATH 'lax $.FULL_NAME', > id int PATH 'lax $.id' > ) >) > ; >name | id > --+ > JOHN DOE | > (1 row) > ) > > if it is not able to represent it to integer because of bigger input, it > should error out with a similar error message instead of leaving it empty. > > Thoughts? > Ah, I see! Yes, that's a bit weird. Put slightly differently: test=# SELECT * FROM JSON_TABLE(jsonb '{"id" : "20"}', '$' COLUMNS(id int PATH '$.id')); id 20 (1 row) Time: 0.248 ms test=# SELECT * FROM JSON_TABLE(jsonb '{"id" : "30"}', '$' COLUMNS(id int PATH '$.id')); id (1 row) Clearly, when converting the string literal into int value, there's some sort of error handling that realizes 3B overflows, and returns NULL instead. I'm not sure if this is intentional. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: A problem about partitionwise join
On Thu, Feb 22, 2024 at 2:56 PM Ashutosh Bapat wrote: > On Wed, Feb 21, 2024 at 4:55 PM Richard Guo > wrote: > > > > > > On Tue, Aug 2, 2022 at 4:24 AM Jacob Champion > wrote: > >> > >> Once you think you've built up some community support and the patchset > >> is ready for review, you (or any interested party) can resurrect the > >> patch entry by visiting > >> > >> https://commitfest.postgresql.org/38/2266/ > >> > >> and changing the status to "Needs Review", and then changing the > >> status again to "Move to next CF". (Don't forget the second step; > >> hopefully we will have streamlined this in the near future!) > > > > > > This patch was returned due to 'lack of interest'. However, upon > > verification, it appears that the reported issue still exists, and the > > proposed fix in the thread remains valid. Hence, resurrect this patch > > after rebasing it on master. I've also written a detailed commit > > message which hopefully can help people review the changes more > > effectively. > > I did a deeper review of the patch. Here are some comments Approach The equijoin condition between partition keys doesn't appear in the join's restrictilist because of 'best_score' strategy as you explained well in [2]. What if we add an extra score for clauses between partition keys and give preference to equijoin between partition keys? Have you given it a thought? I feel that having an equijoin clause involving partition keys has more usages compared to a clause with any random column. E.g. nextloop may be able to prune partitions from inner relation if the clause contains a partition key. Partition pruning requires equality clauses on partition keys as well. create_append_plan() fetches those from best_path->param_info. If we created and saved the clauses involving partition keys somewhere separately, similar to the clauses involving index keys, it might help this case as well as the partition pruning code. Have you considered this idea? There was a proposal to use ECs for outer joins as well and then use only ECs to decide whether equijoins between partition keys exist. I don't think the proposal has materialized. So we have to continue looking at restrictlist as well. I don't see a point waiting for it, but others might feel differently. I am just trying to find ways to avoid two loops in have_partkey_equi_join(). If the alternatives are worse, I think the current approach is fine. Documentation - The patch does not modify any documentation. The only documentation I could find about partitionwise join is the one for GUC 'enable_partitionwise_join'. It says --- quote "Partitionwise join currently applies only when the join conditions include all the partition keys, which must be of the same data type and have one-to-one matching sets of child partitions.". --- unquote This sentence is general and IMO covers the case this patch considers. But in general I feel that partitionwise join and aggregation deserve separate sections next to "partition pruning" in [1]; It should mention advanced partition matching algorithm as well. Would you be willing to write one and then expand it for the case in the patch? Tests - The patch adds a testcase for single column partitioning. I think we need to do better like 1. Test for partitioning on expression, multilevel partitioning, advanced partition matching. Those all might just work. Having tests helps us to notice any future breakage. 2. Some negative test case e.g. equijoin clauses with disjunction, with inequality operator, equality operators with operators from different families etc. 3. The testcase added looks artificial. it outputs data that has same value for two columns which is equal to the primary key of the other table - when would somebody do that?. Is there some real life example where this change will be useful? Code Minor comment for now. It will be better to increment num_equal_pks immediately after setting pk_known_equal[ipk] = true. Otherwise the code gets confusing around line 2269. I will spend more time reviewing the code next week. [1] https://www.postgresql.org/docs/current/ddl-partitioning.html [2] https://www.postgresql.org/message-id/can_9jtxucgdvy9tv6uxq0cdhrw98bztxpkfbf_75qdpi5wb...@mail.gmail.com -- Best Wishes, Ashutosh Bapat
Re: Properly pathify the union planner
On Thu, 15 Feb 2024 at 17:30, David Rowley wrote: > > On Tue, 6 Feb 2024 at 22:05, Richard Guo wrote: > > I'm thinking that maybe it'd be better to move the work of sorting the > > subquery's paths to the outer query level, specifically within the > > build_setop_child_paths() function, just before we stick SubqueryScanPath > > on top of the subquery's paths. I think this is better because: > > > > 1. This minimizes the impact on subquery planning and reduces the > > footprint within the grouping_planner() function as much as possible. > > > > 2. This can help avoid the aforementioned add_path() issue because the > > two involved paths will be structured as: > > Yes, this is a good idea. I agree with both of your points. > v2 attached. If anyone else or if you want to take another look, let me know soon. Otherwise, I'll assume that's the reviews over and I can take another look again. If nobody speaks up before Monday next week (11th), New Zealand time, I'm going to be looking at this again from the point of view of committing it. Thanks David
Re: Change prefetch and read strategies to use range in pg_prewarm ... and raise a question about posix_fadvise WILLNEED
Hi, On Wed, 6 Mar 2024 at 18:23, Cédric Villemain wrote: > > Hi Nazir, > > > thank you for your review. I comment below. > > > On 05/03/2024 12:07, Nazir Bilal Yavuz wrote: > >> 2. The second one does implement smgrprefetch with range and loops by > >> default per segment to still have a check for interrupts. > > It looks good codewise but RELSEG_SIZE is too big to prefetch. Man > > page of posix_fadvise [1] states that: "The amount of data read may be > > decreased by the kernel depending on virtual memory load. (A few > > megabytes will usually be fully satisfied, and more is rarely > > useful.)". It is trying to prefetch 1GB data now. That could explain > > your observation about differences between nr_cache numbers. > > From an "adminsys" point of view I will find beneficial to get a single > syscall per file, respecting the logic and behavior of underlying system > call. I agree. > The behavior is 100% OK, and in fact it might a bad idea to prefetch > block by block as the result is just to put more pressure on a system if > it is already under pressure. > > Though there are use cases and it's nice to be able to do that too at > this per page level. Yes, I do not know which one is more important, cache more blocks but create more pressure or create less pressure but cache less blocks. Also, pg_prewarm is designed to be run at startup so I guess there will not be much pressure. > About [1], it's very old statement about resources. And Linux manages a > part of the problem for us here I think [2]: > > /* > * Chunk the readahead into 2 megabyte units, so that we don't pin too much > * memory at once. > */ > void force_page_cache_ra() Thanks for pointing out the actual code. Yes, it looks like the kernel is already doing that. I would like to do more testing when you forward vm_relation functions into pgfincore. > >> Q: posix_fadvise may not work exactly the way you think it does, or does > >> it ? > >> > >> > >> In details, and for the question: > >> > >> However, if instead you provide a real range, or the magic len=0 to > >> posix_fadvise, then blocks are "more" loaded according to effective vm > >> pressure (which is not the case on the previous example). > >> As a result only a small part of the relation might be loaded, and this > >> is probably not what end-users expect despite being probably a good > >> choice (you can still free cache beforehand to help the kernel). > > I think it's a matter of documenting well the feature, and if at all > possible, as usual, not let users be negatively impacted by default. > > > >> An example, below I'm using vm_relation_cachestat() which provides linux > >> cachestat output, and vm_relation_fadvise() to unload cache, and > >> pg_prewarm for the demo: > >> > >> # clear cache: (nr_cache is the number of file system pages in cache, > >> not postgres blocks) > >> > >> ``` > >> postgres=# select block_start, block_count, nr_pages, nr_cache from > >> vm_relation_cachestat('foo',range:=1024*32); > >> block_start | block_count | nr_pages | nr_cache > >> -+-+--+-- > >> 0 | 32768 |65536 |0 > >> 32768 | 32768 |65536 |0 > >> 65536 | 32768 |65536 |0 > >> 98304 | 32768 |65536 |0 > >>131072 |1672 | 3344 |0 > >> ``` > >> > >> # load full relation with pg_prewarm (patched) > >> > >> ``` > >> postgres=# select pg_prewarm('foo','prefetch'); > >> pg_prewarm > >> > >> 132744 > >> (1 row) > >> ``` > >> > >> # Checking results: > >> > >> ``` > >> postgres=# select block_start, block_count, nr_pages, nr_cache from > >> vm_relation_cachestat('foo',range:=1024*32); > >> block_start | block_count | nr_pages | nr_cache > >> -+-+--+-- > >> 0 | 32768 |65536 | 320 > >> 32768 | 32768 |65536 |0 > >> 65536 | 32768 |65536 |0 > >> 98304 | 32768 |65536 |0 > >>131072 |1672 | 3344 | 320 <-- segment 1 > >> > >> ``` > >> > >> # Load block by block and check: > >> > >> ``` > >> postgres=# select from generate_series(0, 132743) g(n), lateral > >> pg_prewarm('foo','prefetch', 'main', n, n); > >> postgres=# select block_start, block_count, nr_pages, nr_cache from > >> vm_relation_cachestat('foo',range:=1024*32); > >> block_start | block_count | nr_pages | nr_cache > >> -+-+--+-- > >> 0 | 32768 |65536 |65536 > >> 32768 | 32768 |65536 |65536 > >> 65536 | 32768 |65536 |65536 > >> 98304 | 32768 |65536 |65536 > >>131072 |1672 | 3344 | 3344 > >> > >> ``` > >> > >> The duration of the last example is also really significant: full > >> relation is 0.3ms and block by block is 1550ms
Re: a wrong index choose when statistics is out of date
On Thu, 7 Mar 2024 at 23:40, Andy Fan wrote: > > David Rowley writes: > > If you don't want the planner to use the statistics for the column why > > not just do the following? > > Acutally I didn't want the planner to ignore the statistics totally, I > want the planner to treat the "Const" which probably miss optimizer part > average, which is just like what we did for generic plan for the blow > query. I'm with Andrei on this one and agree with his "And it is just luck that you've got the right answer". I think we should fix the general problem of the planner not choosing the better index. I understand you've had a go at that before, but I didn't think fudging the costs was the right fix to coax the planner into the safer choice. I'm not personally interested in any bandaid fixes for this. I'd rather see us come up with a long-term solution that just makes things better. I also understand you're probably frustrated and just want to make something better. However, it's not like it's a new problem. The more general problem of the planner making risky choices outdates both of our time spent working on PostgreSQL, so I don't think a hasty solution that fixes some small subset of the problem is that helpful. David
Re: Properly pathify the union planner
On Thu, Mar 7, 2024 at 7:16 PM David Rowley wrote: > On Thu, 15 Feb 2024 at 17:30, David Rowley wrote: > > > > On Tue, 6 Feb 2024 at 22:05, Richard Guo wrote: > > > I'm thinking that maybe it'd be better to move the work of sorting the > > > subquery's paths to the outer query level, specifically within the > > > build_setop_child_paths() function, just before we stick > SubqueryScanPath > > > on top of the subquery's paths. I think this is better because: > > > > > > 1. This minimizes the impact on subquery planning and reduces the > > > footprint within the grouping_planner() function as much as possible. > > > > > > 2. This can help avoid the aforementioned add_path() issue because the > > > two involved paths will be structured as: > > > > Yes, this is a good idea. I agree with both of your points. > > > v2 attached. > > If anyone else or if you want to take another look, let me know soon. > Otherwise, I'll assume that's the reviews over and I can take another > look again. Hi David, I would like to have another look, but it might take several days. Would that be too late? Thanks Richard
Re: [HACKERS] make async slave to wait for lsn to be replayed
Intro == The main purpose of the feature is to achieve read-your-writes-consistency, while using async replica for reads and primary for writes. In that case lsn of last modification is stored inside application. We cannot store this lsn inside database, since reads are distributed across all replicas and primary. Two implementations of one feature == We left two different solutions. Help me please to choose the best. 1) Classic (wait_classic_v7.patch) https://www.postgresql.org/message-id/3cc883048264c2e9af022033925ff8db%40postgrespro.ru Synopsis == advantages: multiple wait events, separate WAIT FOR statement disadvantages: new words in grammar WAIT FOR [ANY | ALL] event [, ...] BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] [ WAIT FOR [ANY | ALL] event [, ...]] event: LSN value TIMEOUT number_of_milliseconds timestamp 2) After style: Kyotaro and Freund (wait_after_within_v6.patch) https://www.postgresql.org/message-id/d3ff2e363af60b345f82396992595a03%40postgrespro.ru Synopsis == advantages: no new words in grammar disadvantages: a little harder to understand, fewer events to wait AFTER lsn_event [ WITHIN delay_milliseconds ] [, ...] BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] [ AFTER lsn_event [ WITHIN delay_milliseconds ]] START [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] [ AFTER lsn_event [ WITHIN delay_milliseconds ]] Examples == primary standby --- postgresql.conf recovery_min_apply_delay = 10s CREATE TABLE tbl AS SELECT generate_series(1,10) AS a; INSERT INTO tbl VALUES (generate_series(11, 20)); SELECT pg_current_wal_lsn(); BEGIN WAIT FOR LSN '0/3002AE8'; SELECT * FROM tbl; // read fresh insertions COMMIT; Rebased and ready for review. -- Ivan Kartyshov Postgres Professional: www.postgrespro.comdiff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml index 4a42999b18..657a217e27 100644 --- a/doc/src/sgml/ref/allfiles.sgml +++ b/doc/src/sgml/ref/allfiles.sgml @@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory. + diff --git a/doc/src/sgml/ref/begin.sgml b/doc/src/sgml/ref/begin.sgml index 016b021487..b3af16c09f 100644 --- a/doc/src/sgml/ref/begin.sgml +++ b/doc/src/sgml/ref/begin.sgml @@ -21,13 +21,21 @@ PostgreSQL documentation -BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] +BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] wait_for_event where transaction_mode is one of: ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED } READ WRITE | READ ONLY [ NOT ] DEFERRABLE + +where wait_for_event is: +WAIT FOR [ANY | ALL] event [, ...] + +and event is one of: +LSN lsn_value +TIMEOUT number_of_milliseconds +timestamp diff --git a/doc/src/sgml/ref/start_transaction.sgml b/doc/src/sgml/ref/start_transaction.sgml index 74ccd7e345..1b54ed2084 100644 --- a/doc/src/sgml/ref/start_transaction.sgml +++ b/doc/src/sgml/ref/start_transaction.sgml @@ -21,13 +21,21 @@ PostgreSQL documentation -START TRANSACTION [ transaction_mode [, ...] ] +START TRANSACTION [ transaction_mode [, ...] ] wait_for_event where transaction_mode is one of: ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED } READ WRITE | READ ONLY [ NOT ] DEFERRABLE + +where wait_for_event is: +WAIT FOR [ANY | ALL] event [, ...] + +and event is one of: +LSN lsn_value +TIMEOUT number_of_milliseconds +timestamp diff --git a/doc/src/sgml/ref/wait.sgml b/doc/src/sgml/ref/wait.sgml new file mode 100644 index 00..26cae3ad85 --- /dev/null +++ b/doc/src/sgml/ref/wait.sgml @@ -0,0 +1,146 @@ + + + + + WAIT FOR + + + + WAIT FOR + 7 + SQL - Language Statements + + + + WAIT FOR + wait for the target LSN to be replayed or for specified time to pass + + + + +WAIT FOR [ANY | ALL] event [, ...] + +where event is one of: +LSN value +TIMEOUT number_of_milliseconds +timestamp + +WAIT FOR LSN 'lsn_number' +WAIT FOR LSN 'lsn_number' TIMEOUT wait_timeout +WAIT FOR LSN 'lsn_number', TIMESTAMP wait_time +WAIT FOR TIMESTAMP wait_time +WAIT FOR ALL LSN 'lsn_number' TIMEOUT wait_timeout, TIMESTAMP wait_time +WAIT FOR ANY LSN 'lsn_number', TIMESTAMP wait_time + + + + + Description + + + WAIT FOR provides a simple interprocess communication + mechanism to wait for timestamp, timeout or the target log sequence number + (LSN) on standby in PostgreSQL + databases with master-standby asynchronous replication. When run with the + LSN option, the WAIT FOR + command waits for the specified LSN to be replayed. + If no timestamp or timeout was specified, wait time is unlimited. + Waiting can be interrupted using Ctrl+C, or + by shutting down the postgres server. + + + + + + Parameters + + + +
Re: Shared detoast Datum proposal
On 3/7/24 08:33, Nikita Malakhov wrote: > Hi! > > Tomas, I thought about this issue - >> What if you only need the first slice? In that case decoding everything >> will be a clear regression. > And completely agree with you, I'm currently working on it and will post > a patch a bit later. Cool. I don't plan to work on improving my patch - it was merely a PoC, so if you're interested in working on that, that's good. > Another issue we have here - if we have multiple slices requested - > then we have to update cached slice from both sides, the beginning > and the end. > No opinion. I haven't thought about how to handle slices too much. > On update, yep, you're right >> Surely the update creates a new TOAST value, with a completely new TOAST >> pointer, new rows in the TOAST table etc. It's not updated in-place. So >> it'd end up with two independent entries in the TOAST cache. > >> Or are you interested just in evicting the old value simply to free the >> memory, because we're not going to need that (now invisible) value? That >> seems interesting, but if it's too invasive or complex I'd just leave it >> up to the LRU eviction (at least for v1). > Again, yes, we do not need the old value after it was updated and > it is better to take care of it explicitly. It's a simple and not invasive > addition to your patch. > OK > Could not agree with you about large values - it makes sense > to cache large values because the larger it is the more benefit > we will have from not detoasting it again and again. One way > is to keep them compressed, but what if we have data with very > low compression rate? > I'm not against caching large values, but I also think it's not difficult to construct cases where it's a losing strategy. > I also changed HASHACTION field value from HASH_ENTER to > HASH_ENTER_NULL to softly deal with case when we do not > have enough memory and added checks for if the value was stored > or not. > I'm not sure about this. HASH_ENTER_NULL is only being used in three very specific places (two are lock management, one is shmeminitstruct). This hash table is not unique in any way, so why not to use HASH_ENTER like 99% other hash tables? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Invalid query generated by postgres_fdw with UNION ALL and ORDER BY
On Thu, Mar 7, 2024 at 4:39 PM David Rowley wrote: > On Thu, 7 Mar 2024 at 19:09, Michał Kłeczek wrote: > > > > The following query: > > > > SELECT * FROM ( > > SELECT 2023 AS year, * FROM remote_table_1 > > UNION ALL > > SELECT 2022 AS year, * FROM remote_table_2 > > ) > > ORDER BY year DESC; > > > > yields the following remote query: > > > > SELECT [columns] FROM remote_table_1 ORDER BY 2023 DESC > > > > and subsequently fails remote execution. > > > > > > Not really sure where the problem is - the planner or postgres_fdw. > > I guess it is postgres_fdw not filtering out ordering keys. > > Interesting. I've attached a self-contained recreator for the casual > passerby. > > I think the fix should go in appendOrderByClause(). It's at that > point we look for the EquivalenceMember for the relation and can > easily discover if the em_expr is a Const. I think we can safely just > skip doing any ORDER BY stuff and not worry about if the > literal format of the const will appear as a reference to an ordinal > column position in the ORDER BY clause. > deparseSortGroupClause() calls deparseConst() with showtype = 1. appendOrderByClause() may want to do something similar for consistency. Or remove it from deparseSortGroupClause() as well? > > Something like the attached patch I think should work. > > I wonder if we need a test... > Yes. -- Best Wishes, Ashutosh Bapat
Re: remaining sql/json patches
On Thu, Mar 7, 2024 at 8:13 PM Tomas Vondra wrote: > On 3/7/24 06:18, Himanshu Upadhyaya wrote: Thanks Himanshu for the testing. > > On Wed, Mar 6, 2024 at 9:04 PM Tomas Vondra > > wrote: > >> > >> I'm pretty sure this is the correct & expected behavior. The second > >> query treats the value as string (because that's what should happen for > >> values in double quotes). > >> > >> ok, Then why does the below query provide the correct conversion, even if > > we enclose that in double quotes? > > ‘postgres[102531]=#’SELECT * FROM JSON_TABLE(jsonb '{ > > "id" : "1234567890", > > "FULL_NAME" : "JOHN DOE"}', > > '$' > > COLUMNS( > > name varchar(20) PATH 'lax $.FULL_NAME', > > id int PATH 'lax $.id' > > ) > >) > > ; > >name | id > > --+ > > JOHN DOE | 1234567890 > > (1 row) > > > > and for bigger input(string) it will leave as empty as below. > > ‘postgres[102531]=#’SELECT * FROM JSON_TABLE(jsonb '{ > > "id" : "12345678901", > > "FULL_NAME" : "JOHN DOE"}', > > '$' > > COLUMNS( > > name varchar(20) PATH 'lax $.FULL_NAME', > > id int PATH 'lax $.id' > > ) > >) > > ; > >name | id > > --+ > > JOHN DOE | > > (1 row) > > > > seems it is not something to do with data enclosed in double quotes but > > somehow related with internal casting it to integer and I think in case of > > bigger input it is not able to cast it to integer(as defined under COLUMNS > > as id int PATH 'lax $.id') > > > > ‘postgres[102531]=#’SELECT * FROM JSON_TABLE(jsonb '{ > > "id" : "12345678901", > > "FULL_NAME" : "JOHN DOE"}', > > '$' > > COLUMNS( > > name varchar(20) PATH 'lax $.FULL_NAME', > > id int PATH 'lax $.id' > > ) > >) > > ; > >name | id > > --+ > > JOHN DOE | > > (1 row) > > ) > > > > if it is not able to represent it to integer because of bigger input, it > > should error out with a similar error message instead of leaving it empty. > > > > Thoughts? > > > > Ah, I see! Yes, that's a bit weird. Put slightly differently: > > test=# SELECT * FROM JSON_TABLE(jsonb '{"id" : "20"}', > '$' COLUMNS(id int PATH '$.id')); > id > > 20 > (1 row) > > Time: 0.248 ms > test=# SELECT * FROM JSON_TABLE(jsonb '{"id" : "30"}', > '$' COLUMNS(id int PATH '$.id')); > id > > > (1 row) > > Clearly, when converting the string literal into int value, there's some > sort of error handling that realizes 3B overflows, and returns NULL > instead. I'm not sure if this is intentional. Indeed. This boils down to the difference in the cast expression chosen to convert the source value to int in the two cases. The case where the source value has no quotes, the chosen cast expression is a FuncExpr for function numeric_int4(), which has no way to suppress errors. When the source value has quotes, the cast expression is a CoerceViaIO expression, which can suppress the error. The default behavior is to suppress the error and return NULL, so the correct behavior is when the source value has quotes. I think we'll need either: * fix the code in 0001 to avoid getting numeric_int4() in this case, and generally cast functions that don't have soft-error handling support, in favor of using IO coercion. * fix FuncExpr (like CoerceViaIO) to respect SQL/JSON's request to suppress errors and fix downstream functions like numeric_int4() to comply by handling errors softly. I'm inclined to go with the 1st option as we already have the infrastructure in place -- input functions can all handle errors softly. For the latter, it uses numeric_int4() which doesn't support soft-error handling, so throws the error. With quotes, the -- Thanks, Amit Langote -- Thanks, Amit Langote
Re: [HACKERS] make async slave to wait for lsn to be replayed
On Thu, Mar 7, 2024 at 5:14 PM Kartyshov Ivan wrote: > > Intro > == > The main purpose of the feature is to achieve > read-your-writes-consistency, while using async replica for reads and > primary for writes. In that case lsn of last modification is stored > inside application. We cannot store this lsn inside database, since > reads are distributed across all replicas and primary. > > > Two implementations of one feature > == > We left two different solutions. Help me please to choose the best. > > > 1) Classic (wait_classic_v7.patch) > https://www.postgresql.org/message-id/3cc883048264c2e9af022033925ff8db%40postgrespro.ru > Synopsis > == > advantages: multiple wait events, separate WAIT FOR statement > disadvantages: new words in grammar > > > > WAIT FOR [ANY | ALL] event [, ...] > BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] > [ WAIT FOR [ANY | ALL] event [, ...]] > event: > LSN value > TIMEOUT number_of_milliseconds > timestamp > > > > 2) After style: Kyotaro and Freund (wait_after_within_v6.patch) > https://www.postgresql.org/message-id/d3ff2e363af60b345f82396992595a03%40postgrespro.ru > Synopsis > == > advantages: no new words in grammar > disadvantages: a little harder to understand, fewer events to wait > > > > AFTER lsn_event [ WITHIN delay_milliseconds ] [, ...] > BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] > [ AFTER lsn_event [ WITHIN delay_milliseconds ]] > START [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] > [ AFTER lsn_event [ WITHIN delay_milliseconds ]] > +1 for the second one not only because it avoids new words in grammar but also sounds to convey the meaning. I think you can explain in docs how this feature can be used basically how will one get the correct LSN value to specify. As suggested previously also pick one of the approaches (I would advocate the second one) and keep an option for the second one by mentioning it in the commit message. I hope to see more reviews/discussions or usage like how will users get the LSN value to be specified on the core logic of the feature at this stage. IF possible, state, how real-world applications could leverage this feature. -- With Regards, Amit Kapila.
Re: Change prefetch and read strategies to use range in pg_prewarm ... and raise a question about posix_fadvise WILLNEED
Hi Nazir, On 07/03/2024 12:19, Nazir Bilal Yavuz wrote: On Wed, 6 Mar 2024 at 18:23, Cédric Villemain wrote: The behavior is 100% OK, and in fact it might a bad idea to prefetch block by block as the result is just to put more pressure on a system if it is already under pressure. Though there are use cases and it's nice to be able to do that too at this per page level. Yes, I do not know which one is more important, cache more blocks but create more pressure or create less pressure but cache less blocks. Also, pg_prewarm is designed to be run at startup so I guess there will not be much pressure. autowarm is designed for that purpose but pg_prewarm is free to use when neeed. About [1], it's very old statement about resources. And Linux manages a part of the problem for us here I think [2]: /* * Chunk the readahead into 2 megabyte units, so that we don't pin too much * memory at once. */ void force_page_cache_ra() Thanks for pointing out the actual code. Yes, it looks like the kernel is already doing that. I would like to do more testing when you forward vm_relation functions into pgfincore. I hope to be able to get back there next week max. An example, below I'm using vm_relation_cachestat() which provides linux cachestat output, and vm_relation_fadvise() to unload cache, and pg_prewarm for the demo: # clear cache: (nr_cache is the number of file system pages in cache, not postgres blocks) ``` postgres=# select block_start, block_count, nr_pages, nr_cache from vm_relation_cachestat('foo',range:=1024*32); block_start | block_count | nr_pages | nr_cache -+-+--+-- 0 | 32768 |65536 |0 32768 | 32768 |65536 |0 65536 | 32768 |65536 |0 98304 | 32768 |65536 |0 131072 |1672 | 3344 |0 ``` # load full relation with pg_prewarm (patched) ``` postgres=# select pg_prewarm('foo','prefetch'); pg_prewarm 132744 (1 row) ``` # Checking results: ``` postgres=# select block_start, block_count, nr_pages, nr_cache from vm_relation_cachestat('foo',range:=1024*32); block_start | block_count | nr_pages | nr_cache -+-+--+-- 0 | 32768 |65536 | 320 32768 | 32768 |65536 |0 65536 | 32768 |65536 |0 98304 | 32768 |65536 |0 131072 |1672 | 3344 | 320 <-- segment 1 ``` # Load block by block and check: ``` postgres=# select from generate_series(0, 132743) g(n), lateral pg_prewarm('foo','prefetch', 'main', n, n); postgres=# select block_start, block_count, nr_pages, nr_cache from vm_relation_cachestat('foo',range:=1024*32); block_start | block_count | nr_pages | nr_cache -+-+--+-- 0 | 32768 |65536 |65536 32768 | 32768 |65536 |65536 65536 | 32768 |65536 |65536 98304 | 32768 |65536 |65536 131072 |1672 | 3344 | 3344 ``` The duration of the last example is also really significant: full relation is 0.3ms and block by block is 1550ms! You might think it's because of generate_series or whatever, but I have the exact same behavior with pgfincore. I can compare loading and unloading duration for similar "async" work, here each call is from block 0 with len of 132744 and a range of 1 block (i.e. posix_fadvise on 8kB at a time). So they have exactly the same number of operations doing DONTNEED or WILLNEED, but distinct duration on the first "load": ``` postgres=# select * from vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_DONTNEED'); vm_relation_fadvise - (1 row) Time: 25.202 ms postgres=# select * from vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_WILLNEED'); vm_relation_fadvise - (1 row) Time: 1523.636 ms (00:01.524) <- not free ! postgres=# select * from vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_WILLNEED'); vm_relation_fadvise - (1 row) Time: 24.967 ms ``` I confirm that there is a time difference between calling pg_prewarm by full relation and block by block, but IMO this is expected. When pg_prewarm is called by full relation, it does the initialization part just once but when it is called block by block, it does initialization for each call, right? Not sure what initialization is here exactly, in my example with WILLNEED/DONTNEED there are exactly the same code pattern and syscall request(s), just the flag is distinct, so initialization cost are expected to be very similar. Sorry, there was a miscommunication. I was talking about pg_prewarm's initialization, meaning if the pg_prewarm is called block by block (by using generate_series); it will make block_count times initial
Re: Dump-restore loosing 'attnotnull' bit for DEFERRABLE PRIMARY KEY column(s).
On 2024-Mar-05, Dean Rasheed wrote: > So I think RelationGetIndexAttrBitmap() should include deferrable PKs, I tried this, but it doesn't actually lead to a good place, because if we allow deferrable PKs to identify rows, then they are not useful to find the tuple to update when replicating. Consider the following case: $node_publisher->safe_psql('postgres', 'create table deferred_pk (id int primary key initially deferred, hidden int, value text)'); $node_subscriber->safe_psql('postgres', 'create table deferred_pk (id int primary key initially deferred, hidden int, value text)'); $node_subscriber->safe_psql('postgres', 'alter subscription tap_sub refresh publication'); $node_publisher->safe_psql('postgres', "insert into deferred_pk (id, hidden, value) values (1, 1, 'first')"); $node_publisher->wait_for_catchup('tap_sub'); $node_publisher->safe_psql('postgres', qq{ begin; insert into deferred_pk values (1, 2, 'conflicting'); update deferred_pk set value = value || ', updated' where id = 1 and hidden = 2; update deferred_pk set id = 3, value = value || ', updated' where hidden = 2; commit}); $node_publisher->wait_for_catchup('tap_sub'); my $pubdata = $node_publisher->safe_psql('postgres', 'select * from deferred_pk order by id'); my $subsdata = $node_subscriber->safe_psql('postgres', 'select * from deferred_pk order by id'); is($subsdata, $pubdata, "data is equal"); Here, the publisher's transaction first creates a new record with the same PK, which only works because the PK is deferred; then we update its payload column. When this is replicated, the row is identified by the PK ... but replication actually updates the other row, because it's found first: # Failed test 'data is equal' # at t/003_constraints.pl line 163. # got: '1|2|conflicting # 3|2|conflicting, updated, updated' # expected: '1|1|first # 3|2|conflicting, updated, updated' Actually, is that what happened here? I'm not sure, but clearly this is bogus. So I think the original developers of REPLICA IDENTITY had the right idea here (commit 07cacba983ef), and we mustn't change this aspect, because it'll lead to data corruption in replication. Using a deferred PK for DDL considerations seems OK, but it seems certain that for actual data replication it's going to be a disaster. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: Dump-restore loosing 'attnotnull' bit for DEFERRABLE PRIMARY KEY column(s).
On Thu, 7 Mar 2024 at 13:00, Alvaro Herrera wrote: > > So I think the original developers of REPLICA IDENTITY had the right > idea here (commit 07cacba983ef), and we mustn't change this aspect, > because it'll lead to data corruption in replication. Using a deferred > PK for DDL considerations seems OK, but it seems certain that for actual > data replication it's going to be a disaster. > Yes, that makes sense. If I understand correctly though, the replication code uses relation->rd_replidindex (not relation->rd_pkindex), although sometimes it's the same thing. So can we get away with making sure that RelationGetIndexList() doesn't set relation->rd_replidindex to a deferrable PK, while still allowing relation->rd_pkindex to be one? Regards, Dean
Re: remaining sql/json patches
On Thu, Mar 7, 2024 at 8:06 PM Amit Langote wrote: > > > Indeed. > > This boils down to the difference in the cast expression chosen to > convert the source value to int in the two cases. > > The case where the source value has no quotes, the chosen cast > expression is a FuncExpr for function numeric_int4(), which has no way > to suppress errors. When the source value has quotes, the cast > expression is a CoerceViaIO expression, which can suppress the error. > The default behavior is to suppress the error and return NULL, so the > correct behavior is when the source value has quotes. > > I think we'll need either: > > * fix the code in 0001 to avoid getting numeric_int4() in this case, > and generally cast functions that don't have soft-error handling > support, in favor of using IO coercion. > * fix FuncExpr (like CoerceViaIO) to respect SQL/JSON's request to > suppress errors and fix downstream functions like numeric_int4() to > comply by handling errors softly. > > I'm inclined to go with the 1st option as we already have the > infrastructure in place -- input functions can all handle errors > softly. not sure this is the right way. but attached patches solved this problem. Also, can you share the previous memory-hogging bug issue when you are free, I want to know which part I am missing. v42-0001-minor-refactor-SQL-JSON-query-functions-based.no-cfbot Description: Binary data v42-0002-minor-refactor-JOSN_TABLE-functions-based-on-.no-cfbot Description: Binary data
Re: remaining sql/json patches
Hi, I was experimenting with the v42 patches, and I think the handling of ON EMPTY / ON ERROR clauses may need some improvement. The grammar is currently defined like this: | json_behavior ON EMPTY_P json_behavior ON ERROR_P This means the clauses have to be defined exactly in this order, and if someone does NULL ON ERROR NULL ON EMPTY it results in syntax error. I'm not sure what the SQL standard says about this, but it seems other databases don't agree on the order. Is there a particular reason to not allow both orderings? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: remaining sql/json patches
On Thu, Mar 7, 2024 at 22:46 jian he wrote: > On Thu, Mar 7, 2024 at 8:06 PM Amit Langote > wrote: > > > > > > Indeed. > > > > This boils down to the difference in the cast expression chosen to > > convert the source value to int in the two cases. > > > > The case where the source value has no quotes, the chosen cast > > expression is a FuncExpr for function numeric_int4(), which has no way > > to suppress errors. When the source value has quotes, the cast > > expression is a CoerceViaIO expression, which can suppress the error. > > The default behavior is to suppress the error and return NULL, so the > > correct behavior is when the source value has quotes. > > > > I think we'll need either: > > > > * fix the code in 0001 to avoid getting numeric_int4() in this case, > > and generally cast functions that don't have soft-error handling > > support, in favor of using IO coercion. > > * fix FuncExpr (like CoerceViaIO) to respect SQL/JSON's request to > > suppress errors and fix downstream functions like numeric_int4() to > > comply by handling errors softly. > > > > I'm inclined to go with the 1st option as we already have the > > infrastructure in place -- input functions can all handle errors > > softly. > > not sure this is the right way. > but attached patches solved this problem. > > Also, can you share the previous memory-hogging bug issue > when you are free, I want to know which part I am missing. Take a look at the json_populate_type() call in ExecEvalJsonCoercion() or specifically compare the new way of passing its void *cache parameter with the earlier patches. >
Re: remaining sql/json patches
On 2024-Mar-07, Tomas Vondra wrote: > I was experimenting with the v42 patches, and I think the handling of ON > EMPTY / ON ERROR clauses may need some improvement. Well, the 2023 standard says things like ::= JSON_VALUE [ ] [ ON EMPTY ] [ ON ERROR ] which implies that if you specify it the other way around, it's a syntax error. > I'm not sure what the SQL standard says about this, but it seems other > databases don't agree on the order. Is there a particular reason to > not allow both orderings? I vaguely recall that trying to also support the other ordering leads to having more rules. Now maybe we do want that because of compatibility with other DBMSs, but frankly at this stage I wouldn't bother. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "I am amazed at [the pgsql-sql] mailing list for the wonderful support, and lack of hesitasion in answering a lost soul's question, I just wished the rest of the mailing list could be like this." (Fotis) https://postgr.es/m/200606261359.k5qdxe2p004...@auth-smtp.hol.gr
Re: Add system identifier to backup manifest
On Wed, Mar 6, 2024 at 11:22 PM Amul Sul wrote: >> You are not changing silently the internals of get_controlfile(), so >> no objections here. The name of the new routine could be shorter, but >> being short of ideas what you are proposing looks fine by me. > > Could be get_controlfile_by_path() ? It could. I just thought this was clearer. I agree that it's a bit long, but I don't think this is worth bikeshedding very much. If at a later time somebody feels strongly that it needs to be changed, so be it. Right now, getting on with the business at hand is more important, IMHO. -- Robert Haas EDB: http://www.enterprisedb.com
Re: remaining sql/json patches
On Thu, Mar 7, 2024 at 23:14 Alvaro Herrera wrote: > On 2024-Mar-07, Tomas Vondra wrote: > > > I was experimenting with the v42 patches, and I think the handling of ON > > EMPTY / ON ERROR clauses may need some improvement. > > Well, the 2023 standard says things like > > ::= > JSON_VALUE > > [ ] > [ ON EMPTY ] > [ ON ERROR ] > > > which implies that if you specify it the other way around, it's a syntax > error. > > > I'm not sure what the SQL standard says about this, but it seems other > > databases don't agree on the order. Is there a particular reason to > > not allow both orderings? > > I vaguely recall that trying to also support the other ordering leads to > having more rules. Yeah, I think that was it. At one point, I removed rules supporting syntax that wasn’t documented. Now maybe we do want that because of compatibility > with other DBMSs, but frankly at this stage I wouldn't bother. +1. >
Re: POC, WIP: OR-clause support for indexes
Hi! On Tue, Mar 5, 2024 at 9:59 AM Andrei Lepikhov wrote: > On 5/3/2024 12:30, Andrei Lepikhov wrote: > > On 4/3/2024 09:26, jian he wrote: > ... and the new version of the patchset is attached. I made some revisions for the patchset. 1) Use hash_combine() to combine hash values. 2) Upper limit the number of array elements by MAX_SAOP_ARRAY_SIZE. 3) Better save the original order of clauses by putting hash entries and untransformable clauses to the same list. A lot of differences in regression tests output have gone. One important issue I found. # create table t as (select i::int%100 i from generate_series(1,1) i); # analyze t; # explain select * from t where i = 1 or i = 1; QUERY PLAN - Seq Scan on t (cost=0.00..189.00 rows=200 width=4) Filter: (i = ANY ('{1,1}'::integer[])) (2 rows) # set enable_or_transformation = false; SET # explain select * from t where i = 1 or i = 1; QUERY PLAN - Seq Scan on t (cost=0.00..189.00 rows=100 width=4) Filter: (i = 1) (2 rows) We don't make array values unique. That might make query execution performance somewhat worse, and also makes selectivity estimation worse. I suggest Andrei and/or Alena should implement making array values unique. -- Regards, Alexander Korotkov v20-0002-Teach-generate_bitmap_or_paths-to-build-BitmapOr.patch Description: Binary data v20-0001-Transform-OR-clauses-to-ANY-expression.patch Description: Binary data
Re: Dump-restore loosing 'attnotnull' bit for DEFERRABLE PRIMARY KEY column(s).
On 2024-Mar-07, Dean Rasheed wrote: > On Thu, 7 Mar 2024 at 13:00, Alvaro Herrera wrote: > > > > So I think the original developers of REPLICA IDENTITY had the right > > idea here (commit 07cacba983ef), and we mustn't change this aspect, > > because it'll lead to data corruption in replication. Using a deferred > > PK for DDL considerations seems OK, but it seems certain that for actual > > data replication it's going to be a disaster. > > Yes, that makes sense. If I understand correctly though, the > replication code uses relation->rd_replidindex (not > relation->rd_pkindex), although sometimes it's the same thing. So can > we get away with making sure that RelationGetIndexList() doesn't set > relation->rd_replidindex to a deferrable PK, while still allowing > relation->rd_pkindex to be one? Well, not really, because the logical replication code for some reason uses GetRelationIdentityOrPK(), which falls back to rd->pk_index (via RelationGetPrimaryKeyIndex) if rd_replindex is not set. Maybe we can add a flag RelationData->rd_ispkdeferred, so that RelationGetPrimaryKeyIndex returned InvalidOid for deferrable PKs; then logical replication would continue to not know about this PK, which perhaps is what we want. I'll do some testing with this. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/
Re: WIP Incremental JSON Parser
Some more observations as I make my way through the patch: In src/common/jsonapi.c, > +#define JSON_NUM_NONTERMINALS 6 Should this be 5 now? > + res = pg_parse_json_incremental(&(incstate->lex), &(incstate->sem), > + > chunk, size, is_last); > + > + expected = is_last ? JSON_SUCCESS : JSON_INCOMPLETE; > + > + if (res != expected) > + json_manifest_parse_failure(context, "parsing failed"); This leads to error messages like pg_verifybackup: error: could not parse backup manifest: parsing failed which I would imagine is going to lead to confused support requests in the event that someone does manage to corrupt their manifest. Can we make use of json_errdetail() and print the line and column numbers? Patch 0001 over at [1] has one approach to making json_errdetail() workable in frontend code. Top-level scalars like `false` or `12345` do not parse correctly if the chunk size is too small; instead json_errdetail() reports 'Token "" is invalid'. With small chunk sizes, json_errdetail() additionally segfaults on constructions like `[tru]` or `12zz`. For my local testing, I'm carrying the following diff in 001_test_json_parser_incremental.pl: > - ok($stdout =~ /SUCCESS/, "chunk size $size: test succeeds"); > - ok(!$stderr, "chunk size $size: no error output"); > + like($stdout, qr/SUCCESS/, "chunk size $size: test succeeds"); > + is($stderr, "", "chunk size $size: no error output"); This is particularly helpful when a test fails spuriously due to code coverage spray on stderr. Thanks, --Jacob [1] https://www.postgresql.org/message-id/CAOYmi%2BmSSY4SvOtVN7zLyUCQ4-RDkxkzmTuPEN%2Bt-PsB7GHnZA%40mail.gmail.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Mar 7, 2024 at 8:06 PM John Naylor wrote: > > On Thu, Mar 7, 2024 at 4:47 PM Masahiko Sawada wrote: > > > > On Thu, Mar 7, 2024 at 6:37 PM John Naylor wrote: > > > > $ git grep 'link_with: pgport_srv' > > > src/test/modules/test_radixtree/meson.build: link_with: pgport_srv, > > > > > > No other test module uses this directive, and indeed, removing this > > > still builds fine for me. Thoughts? > > > > Yeah, it could be the culprit. The test_radixtree/meson.build is the > > sole extension that explicitly specifies a link with pgport_srv. I > > think we can get rid of it as I've also confirmed the build still fine > > even without it. > > olingo and grassquit have turned green, so that must have been it. Cool! I've attached the remaining patches for CI. I've made some minor changes in separate patches and drafted the commit message for tidstore patch. While reviewing the tidstore code, I thought that it would be more appropriate to place tidstore.c under src/backend/lib instead of src/backend/common/access since the tidstore is no longer implemented only for heap or other access methods, and it might also be used by executor nodes in the future. What do you think? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com v68-ART.tar.gz Description: GNU Zip compressed data
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
Hello, I am not up to date with the last version of patch but I did a regular benchmark with version 11 and typical issue we have at the moment and the result are still very very good. [1] In term of performance improvement the last proposals could be a real game changer for 2 of our biggest databases. We hope that Postgres 17 will contain those improvements. Kind regards, Benoit [1] - https://gist.github.com/benoittgt/ab72dc4cfedea2a0c6a5ee809d16e04d?permalink_comment_id=4972955#gistcomment-4972955 __ Benoit Tigeot Le jeu. 7 mars 2024 à 15:36, Peter Geoghegan a écrit : > On Tue, Jan 23, 2024 at 3:22 PM Peter Geoghegan wrote: > > I could include something less verbose, mentioning a theoretical risk > > to out-of-core amcanorder routines that coevolved with nbtree, > > inherited the same SAOP limitations, and then never got the same set > > of fixes. > > Attached is v11, which now says something like that in the commit > message. Other changes: > > * Fixed buggy sorting of arrays using cross-type ORDER procs, by > recognizing that we need to consistently use same-type ORDER procs for > sorting and merging the arrays during array preprocessing. > > Obviously, when we sort, we compare array elements to other array > elements (all of the same type). This is true independent of whether > the query itself happens to use a cross type operator/ORDER proc, so > we will need to do two separate ORDER proc lookups in cross-type > scenarios. > > * No longer subscript the ORDER proc used for array binary searches > using a scankey subscript. Now there is an additional indirection that > works even in the presence of multiple redundant scan keys that cannot > be detected as such due to a lack of appropriate cross-type support > within an opfamily. > > This was subtly buggy before now. Requires a little more coordination > between array preprocessing and standard/primitive index scan > preprocessing, which isn't ideal but seems unavoidable. > > * Lots of code polishing, especially within _bt_advance_array_keys(). > > While _bt_advance_array_keys() still works in pretty much exactly the > same way as it did back in v10, there are now better comments. > Including something about why its recursive call to itself is > guaranteed to use a low, fixed amount of stack space, verified using > an assertion. That addresses a concern held by Matthias. > > Outlook > === > > This patch is approaching being committable now. Current plan is to > commit this within the next few weeks. > > All that really remains now is to research how we might integrate this > work with the recently added continuescanPrechecked/haveFirstMatch > stuff from Alexander Korotkov, if at all. I've put that off until now > because it isn't particularly fundamental to what I'm doing here, and > seems optional. > > I would also like to do more performance validation. Things like the > parallel index scan code could stand to be revisited once again. Plus > I should think about the overhead of array preprocessing when > btrescan() is called many times, from a nested loop join -- I should > have something to say about that concern (raised by Heikki at one > point) before too long. > > -- > Peter Geoghegan >
Re: RangeTblEntry.inh vs. RTE_SUBQUERY
On 03.03.24 11:02, Peter Eisentraut wrote: On 29.02.24 19:14, Tom Lane wrote: Peter Eisentraut writes: In nodes/parsenodes.h, it says both This *must* be false for RTEs other than RTE_RELATION entries. Well, that's true in the parser ... and also puts it under Fields valid in all RTEs: which are both wrong on opposite ends of the spectrum. I think it would make more sense to group inh under "Fields valid for a plain relation RTE" and then explain the exception for subqueries, like it is done for several other fields. Dunno. The adjacent "lateral" field is also used for only selected RTE kinds. The section is /* * Fields valid in all RTEs: */ Alias *alias; /* user-written alias clause, if any */ Alias *eref; /* expanded reference names */ bool lateral; /* subquery, function, or values is LATERAL? */ bool inh; /* inheritance requested? */ bool inFromCl; /* present in FROM clause? */ List *securityQuals; /* security barrier quals to apply, if any */ According to my testing, lateral is used for RTE_RELATION, RTE_SUBQUERY, RTE_FUNCTION, RTE_TABLEFUNC, RTE_VALUES, which is 5 out of 9 possible. So I think it might be okay to relabel that section (in actuality or mentally) as "valid in several/many/most RTEs". But I'm not sure what reason there would be for having inh there, which is better described as "valid for RTE_RELATION, but also borrowed by RTE_SUBQUERY", which is pretty much exactly what is the case for relid, relkind, etc. I have committed the patches for this discussion. Related discussion will continue at https://www.postgresql.org/message-id/flat/4b27fc50-8cd6-46f5-ab20-88dbaadca...@eisentraut.org / https://commitfest.postgresql.org/47/4697/ .
Re: improve ssl error code, 2147483650
On 07/03/2024 02:12, David Zhang wrote: The SSL_CTX_load_verify_locations function in OpenSSL will return NULL if there is a system error, such as "No such file or directory" in this case: const char *ERR_reason_error_string(unsigned long e) { ERR_STRING_DATA d, *p = NULL; unsigned long l, r; if (!RUN_ONCE(&err_string_init, do_err_strings_init)) { return NULL; } /* * ERR_reason_error_string() can't safely return system error strings, * since openssl_strerror_r() needs a buffer for thread safety, and we * haven't got one that would serve any sensible purpose. */ if (ERR_SYSTEM_ERROR(e)) return NULL; That's pretty unfortunate. As typical with OpenSSL, this stuff is not very well documented, but I think we could do something like this in SSLerrmessage(): if (ERR_SYSTEM_ERROR(e)) errreason = strerror(ERR_GET_REASON(e)); ERR_SYSTEM_ERROR only exists in OpenSSL 3.0 and above, and the only documentation I could find was in this one obscure place in the man pages: https://www.openssl.org/docs/man3.2/man3/BIO_dgram_get_local_addr_enable.html. But as a best-effort thing, it would still be better than "SSL error code 2147483650". It would be better to perform a simple SSL file check before passing the SSL file to OpenSSL APIs so that the system error can be captured and a meaningful message provided to the end user. That feels pretty ugly. I agree it would catch most of the common mistakes in practice, so maybe we should just hold our noses and do it anyway, if the above ERR_SYSTEM_ERROR() method doesn't work. It's sad that we cannot pass a file descriptor or in-memory copy of the file contents to those functions. -- Heikki Linnakangas Neon (https://neon.tech)
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Mar 7, 2024 at 8:06 PM John Naylor wrote: > > On Thu, Mar 7, 2024 at 4:47 PM Masahiko Sawada wrote: > > > > On Thu, Mar 7, 2024 at 6:37 PM John Naylor wrote: > > > > $ git grep 'link_with: pgport_srv' > > > src/test/modules/test_radixtree/meson.build: link_with: pgport_srv, > > > > > > No other test module uses this directive, and indeed, removing this > > > still builds fine for me. Thoughts? > > > > Yeah, it could be the culprit. The test_radixtree/meson.build is the > > sole extension that explicitly specifies a link with pgport_srv. I > > think we can get rid of it as I've also confirmed the build still fine > > even without it. > > olingo and grassquit have turned green, so that must have been it. fairywren is complaining another build failure: [1931/2156] "gcc" -o src/test/modules/test_radixtree/test_radixtree.dll src/test/modules/test_radixtree/test_radixtree.dll.p/win32ver.obj src/test/modules/test_radixtree/test_radixtree.dll.p/test_radixtree.c.obj "-Wl,--allow-shlib-undefined" "-shared" "-Wl,--start-group" "-Wl,--out-implib=src/test\\modules\\test_radixtree\\test_radixtree.dll.a" "-Wl,--stack,4194304" "-Wl,--allow-multiple-definition" "-Wl,--disable-auto-import" "-fvisibility=hidden" "C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/src/backend/libpostgres.exe.a" "-pthread" "C:/tools/nmsys64/ucrt64/bin/../lib/libssl.dll.a" "C:/tools/nmsys64/ucrt64/bin/../lib/libcrypto.dll.a" "C:/tools/nmsys64/ucrt64/bin/../lib/libz.dll.a" "-lws2_32" "-lm" "-lkernel32" "-luser32" "-lgdi32" "-lwinspool" "-lshell32" "-lole32" "-loleaut32" "-luuid" "-lcomdlg32" "-ladvapi32" "-Wl,--end-group" FAILED: src/test/modules/test_radixtree/test_radixtree.dll "gcc" -o src/test/modules/test_radixtree/test_radixtree.dll src/test/modules/test_radixtree/test_radixtree.dll.p/win32ver.obj src/test/modules/test_radixtree/test_radixtree.dll.p/test_radixtree.c.obj "-Wl,--allow-shlib-undefined" "-shared" "-Wl,--start-group" "-Wl,--out-implib=src/test\\modules\\test_radixtree\\test_radixtree.dll.a" "-Wl,--stack,4194304" "-Wl,--allow-multiple-definition" "-Wl,--disable-auto-import" "-fvisibility=hidden" "C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/src/backend/libpostgres.exe.a" "-pthread" "C:/tools/nmsys64/ucrt64/bin/../lib/libssl.dll.a" "C:/tools/nmsys64/ucrt64/bin/../lib/libcrypto.dll.a" "C:/tools/nmsys64/ucrt64/bin/../lib/libz.dll.a" "-lws2_32" "-lm" "-lkernel32" "-luser32" "-lgdi32" "-lwinspool" "-lshell32" "-lole32" "-loleaut32" "-luuid" "-lcomdlg32" "-ladvapi32" "-Wl,--end-group" C:/tools/nmsys64/ucrt64/bin/../lib/gcc/x86_64-w64-mingw32/12.2.0/../../../../x86_64-w64-mingw32/bin/ld.exe: src/test/modules/test_radixtree/test_radixtree.dll.p/test_radixtree.c.obj:test_radixtree:(.rdata$.refptr.pg_popcount64[.refptr.pg_popcount64]+0x0): undefined reference to `pg_popcount64' It looks like it requires a link with pgport_srv but I'm not sure. It seems that the recent commit 1f1d73a8b breaks CI, Windows - Server 2019, VS 2019 - Meson & ninja, too. Regards, [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2024-03-07%2012%3A53%3A20 -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Function and Procedure with same signature?
Hannu Krosing writes: > On Sat, Feb 10, 2024 at 12:38 AM Tom Lane wrote: >> Worth noting perhaps that this is actually required by the SQL >> standard: per spec, functions and procedures are both "routines" >> and share the same namespace, > Can you point me to a place in the standard where it requires all > kinds of ROUTINES to be using the same namespace ? [ digs around a bit... ] Well, the spec is vaguer about this than I thought. It is clear on one thing: 11.60 conformance rules include 2) Without Feature T341, “Overloading of SQL-invoked functions and SQL-invoked procedures”, conforming SQL language shall not contain a in which the schema identified by the explicit or implicit of the includes a routine descriptor whose routine name is . ("" means a CREATE FUNCTION or CREATE PROCEDURE statement.) That is, basic SQL doesn't allow you to have two routines with the same qualified name at all, whether they have different types and parameter lists or not. Now the above text is the entire definition of T341, and it doesn't say just what an implementation that claims feature T341 is expected to allow. Looking through the rest of 11.60, we find 9) u) The schema identified by the explicit or implicit of the shall not include a routine descriptor whose specific name is equivalent to or a user-defined type descriptor that includes a method specification descriptor whose specific name is equivalent to . which evidently is describing the base case (with no mention of T341). We also find 20) Case: a) If R is an SQL-invoked procedure, then S shall not include another SQL-invoked procedure whose is equivalent to RN and whose number of SQL parameters is PN. which is a weird halfway measure indeed, and there's no acknowledgement that this contradicts 9u. The other arm of the case is pretty impenetrable prose, but what it appears to be saying is that you can't create two functions of the same name and same number of parameters if that would create any call-time ambiguity, that is that a call could be written that might refer to either. So I guess these paras are meant to explain what should happen if T341 is implemented, but it's far from clear, and certainly their restrictions on overloading are much stronger than what we allow. If you look in 10.6 (which is what is referenced by 11.62 ) you find 4) If specifies ROUTINE, then there shall be exactly one SQL-invoked routine in the schema identified by SCN whose is RN such that, for all i, 1 (one) ≤ i ≤ the number of arguments, when the Syntax Rules of Subclause 9.25, “Data type identity”, are applied with the declared type of its i-th SQL parameter as TYPE1 and the i-th in the of MN as TYPE2, those Syntax Rules are satisfied. The identifies that SQL-invoked routine. It could be argued that this doesn't prohibit having both a function and a procedure with the same data type list, only that you can't write ROUTINE when trying to drop or alter one. But I think that's just motivated reasoning. The paragraphs for being FUNCTION or PROCEDURE are exactly like the above except they say "exactly one function" or "exactly one procedure". If you argue that this text means we must allow functions and procedures with the same parameter lists, then you are also arguing that we must allow multiple functions with the same parameter lists, and it's just the user's tough luck if they need to drop one of them. A related point is that our tolerance for overloading routine names isn't unlimited: we don't allow duplicate names with the same list of input parameters, even if the output parameters are different. This is not something that the SQL spec cares to address, but there are good ambiguity-avoidance reasons for it. I think limiting overloading so that a ROUTINE specification is unambiguous is also a good thing. I remain unexcited about changing our definition of this. "Oracle allows it" is not something that has any weight in my view: they have made a bunch of bad design decisions as well as good ones, and I think this is a bad one. regards, tom lane
Re: un-revert the MAINTAIN privilege and the pg_maintain predefined role
On Tue, Mar 05, 2024 at 10:12:35AM -0600, Nathan Bossart wrote: > Thanks to Jeff's recent work with commits 2af07e2 and 59825d1, the issue > that led to the revert of the MAINTAIN privilege and the pg_maintain > predefined role (commit 151c22d) should now be resolved. Specifically, > there was a concern that roles with the MAINTAIN privilege could use > search_path tricks to run arbitrary code as the table owner. Jeff's work > prevents this by restricting search_path to a known safe value when running > maintenance commands. (This approach and others were discussed on the > lists quite extensively, and it was also brought up at the developer > meeting at FOSDEM [0] earlier this year.) > > Given this, I'd like to finally propose un-reverting MAINTAIN and > pg_maintain. I created a commitfest entry for this [1] a few weeks ago and > attached it to Jeff's search_path thread, but I figured it would be good to > create a dedicated thread for this, too. The attached patch is a straight > revert of commit 151c22d except for the following small changes: > > * The catversion bump has been removed for now. The catversion will need > to be bumped appropriately if/when this is committed. > > * The OID for the pg_maintain predefined role needed to be changed. The > original OID has been reused for something else since this feature was > reverted. > > * The change in AdjustUpgrade.pm needed to be updated to check for > "$old_version < 17" instead of "$old_version < 16". Given all of this code was previously reviewed and committed, I am planning to forge ahead and commit this early next week, provided no objections or additional feedback materialize. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: table inheritance versus column compression and storage settings
Hi Peter and Tom, > On Tue, Feb 20, 2024 at 3:51 PM Peter Eisentraut > wrote: > > > > > > I have reverted the patch for now (and re-opened the commitfest entry). > > > We should continue to work on this and see if we can at least try to > get > > > the pg_dump test coverage suitable. > > > > > > The pg_dump problems arise because we throw an error when parents have conflicting compression and storage properties. The patch that got reverted, changed this slightly by allowing a child to override parent's properties even when they conflict. It still threw an error when child didn't override and parents conflicted. I guess, MergeAttributes() raises error when it encounters parents with conflicting properties because it can not decide which of the conflicting properties the child should inherit. Instead it could just set the DEFAULT properties when parent properties conflict but child doesn't override. Thus when compression conflicts, child's compression would be set to default and when storage conflicts it will be set to the type's default storage. Child's properties when specified explicitly would override always. This will solve all the pg_dump bugs we saw with the reverted patch and also existing bug I reported earlier. This change would break backward compatibility but I don't think anybody would rely on error being thrown when parent properties conflict. What do you think? -- Best Wishes, Ashutosh Bapat
Re: autovectorize page checksum code included elsewhere
I don't think anything discussed in this thread is ready for v17, so I am going to punt it to v18. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: [PATCH] LockAcquireExtended improvement
On Thu, Feb 8, 2024 at 5:28 AM Jingxian Li wrote: > Based on your comments above, I improve the commit message and comment for > InsertSelfIntoWaitQueue in new patch. Well, I had a look at this patch today, and even after reading the new commit message, I couldn't really convince myself that it was correct. It may well be entirely correct, but I simply find it hard to tell. It would help if the comments had been adjusted a bit more, e.g. /* Skip the wait and just grant myself the lock. */ - GrantLock(lock, proclock, lockmode); - GrantAwaitedLock(); return PROC_WAIT_STATUS_OK; Surely this is not an acceptable change. The comments says "and just grant myself the lock" but the code no longer does that. But instead of just complaining, I decided to try writing a version of the patch that seemed acceptable to me. Here it is. I took a different approach than you. Instead of splitting up ProcSleep(), I just passed down the dontWait flag to WaitOnLock() and ProcSleep(). In LockAcquireExtended(), I moved the existing code that handles giving up in the don't-wait case from before the call to ProcSleep() to afterward. As far as I can see, the major way this could be wrong is if calling ProcSleep() with dontWait = true and having it fail to acquire the lock changes the state in some way that makes the cleanup code that I moved incorrect. I don't *think* that's the case, but I might be wrong. What do you think of this version? -- Robert Haas EDB: http://www.enterprisedb.com v3-0001-Allow-a-no-wait-lock-acquisition-to-succeed-in-mo.patch Description: Binary data
Re: speed up a logical replica setup
Hi, I decided to take a quick look on this patch today, to see how it works and do some simple tests. I've only started to get familiar with it, so I have only some comments / questions regarding usage, not on the code. It's quite possible I didn't understand some finer points, or maybe it was already discussed earlier in this very long thread, so please feel free to push back or point me to the past discussion. Also, some of this is rather opinionated, but considering I didn't see this patch before, my opinions may easily be wrong ... 1) SGML docs It seems the SGML docs are more about explaining how this works on the inside, rather than how to use the tool. Maybe that's intentional, but as someone who didn't work with pg_createsubscriber before I found it confusing and not very helpful. For example, the first half of the page is prerequisities+warning, and sure those are useful details, but prerequisities are checked by the tool (so I can't really miss this) and warnings go into a lot of details about different places where things may go wrong. Sure, worth knowing and including in the docs, but maybe not right at the beginning, before I learn how to even run the tool? Maybe that's just me, though. Also, I'm sure it's not the only part of our docs like this. Perhaps it'd be good to reorganize the content a bit to make the "how to use" stuff more prominent? 2) this is a bit vague ... pg_createsubscriber will check a few times if the connection has been reestablished to stream the required WAL. After a few attempts, it terminates with an error. What does "a few times" mean, and how many is "a few attempts"? Seems worth knowing when using this tool in environments where disconnections can happen. Maybe this should be configurable? 3) publication info For a while I was quite confused about which tables get replicated, until I realized the publication is FOR ALL TABLES. But I learned that from this thread, the docs say nothing about this. Surely that's an important detail that should be mentioned? 4) Is FOR ALL TABLES a good idea? I'm not sure FOR ALL TABLES is a good idea. Or said differently, I'm sure it won't work for a number of use cases. I know large databases it's common to create "work tables" (not necessarily temporary) as part of a batch job, but there's no need to replicate those tables. AFAIK that'd break this FOR ALL TABLES publication, because the tables will qualify for replication, but won't be present on the subscriber. Or did I miss something? I do understand that FOR ALL TABLES is the simplest approach, and for v1 it may be an acceptable limitation, but maybe it'd be good to also support restricting which tables should be replicated (e.g. blacklist or whitelist based on table/schema name?). BTW if I'm right and creating a table breaks the subscriber creation, maybe it'd be good to explicitly mention that in the docs. Note: I now realize this might fall under the warning about DDL, which says this: Executing DDL commands on the source server while running pg_createsubscriber is not recommended. If the target server has already been converted to logical replica, the DDL commands must not be replicated so an error would occur. But I find this confusing. Surely there are many DDL commands that have absolutely no impact on logical replication (like creating an index or view, various ALTER TABLE flavors, and so on). And running such DDL certainly does not trigger error, right? 5) slot / publication / subscription name I find it somewhat annoying it's not possible to specify names for objects created by the tool - replication slots, publication and subscriptions. If this is meant to be a replica running for a while, after a while I'll have no idea what pg_createsubscriber_569853 or pg_createsubscriber_459548_2348239 was meant for. This is particularly annoying because renaming these objects later is either not supported at all (e.g. for replication slots), or may be quite difficult (e.g. publications). I do realize there are challenges with custom names (say, if there are multiple databases to replicate), but can't we support some simple formatting with basic placeholders? So we could specify --slot-name "myslot_%d_%p" or something like that? BTW what will happen if we convert multiple standbys? Can't they all get the same slot name (they all have the same database OID, and I'm not sure how much entropy the PID has)? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
DOCS: add helpful partitioning links
This patch adds a link to the "attach partition" command section (similar to the detach partition link above it) as well as a link to "create table like" as both commands contain additional information that users should review beyond what is laid out in this section. There's also a couple of wordsmiths in nearby areas to improve readability. Robert Treat https://xzilla.net improve-partition-links.patch Description: Binary data
Re: Extension Enhancement: Buffer Invalidation in pg_buffercache
Quite an interesting patch, in my opinion. I've decided to work on it a bit, did some refactoring (sorry) and add basic tests. Also, I try to take into account as much as possible notes on the patch, mentioned by Cédric Villemain. > and maybe better to go with FlushOneBuffer() ? It's a good idea, but I'm not sure at the moment. I'll try to dig some deeper into it. At least, FlushOneBuffer does not work for a local buffers. So, we have to decide whatever pg_buffercache_invalidate should or should not work for local buffers. For now, I don't see why it should not. Maybe I miss something? -- Best regards, Maxim Orlov. v4-0001-Invalidate-Buffer-By-Bufnum.patch Description: Binary data
Re: improve ssl error code, 2147483650
Greetings, * Heikki Linnakangas (hlinn...@iki.fi) wrote: > On 07/03/2024 02:12, David Zhang wrote: > > The SSL_CTX_load_verify_locations function in OpenSSL will return NULL > > if there is a system error, such as "No such file or directory" in this > > case: > > > > const char *ERR_reason_error_string(unsigned long e) > > { > > ERR_STRING_DATA d, *p = NULL; > > unsigned long l, r; > > > > if (!RUN_ONCE(&err_string_init, do_err_strings_init)) { > > return NULL; > > } > > > > /* > > * ERR_reason_error_string() can't safely return system error strings, > > * since openssl_strerror_r() needs a buffer for thread safety, and we > > * haven't got one that would serve any sensible purpose. > > */ > > if (ERR_SYSTEM_ERROR(e)) > > return NULL; > > That's pretty unfortunate. As typical with OpenSSL, this stuff is not very > well documented, but I think we could do something like this in > SSLerrmessage(): > > if (ERR_SYSTEM_ERROR(e)) > errreason = strerror(ERR_GET_REASON(e)); > > ERR_SYSTEM_ERROR only exists in OpenSSL 3.0 and above, and the only > documentation I could find was in this one obscure place in the man pages: > https://www.openssl.org/docs/man3.2/man3/BIO_dgram_get_local_addr_enable.html. > But as a best-effort thing, it would still be better than "SSL error code > 2147483650". Agreed that it doesn't seem well documented. I was trying to figure out what the 'right' answer here was myself and not having much success. If the above works, then +1 to that. > > It would be better to perform a simple SSL file check before passing the > > SSL file to OpenSSL APIs so that the system error can be captured and a > > meaningful message provided to the end user. > > That feels pretty ugly. I agree it would catch most of the common mistakes > in practice, so maybe we should just hold our noses and do it anyway, if the > above ERR_SYSTEM_ERROR() method doesn't work. Yeah, seems better to try and handle this the OpenSSL way ... if that's possible to do. > It's sad that we cannot pass a file descriptor or in-memory copy of the file > contents to those functions. Agreed. Thanks! Stephen signature.asc Description: PGP signature
Re: 035_standby_logical_decoding unbounded hang
On Thu, Mar 07, 2024 at 02:46:55PM +0500, Andrey M. Borodin wrote: > I’m not sure if it is connected, but so far many patches in CFbot keep > hanging in this test. For example [0]. > [0] https://cirrus-ci.com/task/5911176840740864?logs=check_world#L292 Relevant part: [22:03:10.292] stderr: [22:03:10.292] # poll_query_until timed out executing this query: [22:03:10.292] # SELECT (SELECT catalog_xmin::text::int - 770 from pg_catalog.pg_replication_slots where slot_name = 'injection_activeslot') > 0 [22:03:10.292] # expecting this output: [22:03:10.292] # t [22:03:10.292] # last actual query output: [22:03:10.292] # f [22:03:10.292] # with stderr: [22:03:10.292] # Tests were run but no plan was declared and done_testing() was not seen. [22:03:10.292] # Looks like your test exited with 255 just after 57. The injection_activeslot test got added after $SUBJECT, in 08a52ab151 (2024-03-06). It's now reverted in 65db0cfb4c (2024-03-07).
Re: Add last_commit_lsn to pg_stat_database
I've previously noted in "Add last commit LSN to pg_last_committed_xact()" [1] that it's not possible to monitor how many bytes of WAL behind a logical replication slot is (computing such is obviously trivial for physical slots) because the slot doesn't need to replicate beyond the last commit. In some cases it's possible for the current WAL location to be far beyond the last commit. A few examples: - An idle server with checkout_timeout at a lower value (so lots of empty WAL advance). - Very large transactions: particularly insidious because committing a 1 GB transaction after a small transaction may show almost zero time lag even though quite a bit of data needs to be processed and sent over the wire (so time to replay is significantly different from current lag). - A cluster with multiple databases complicates matters further, because while physical replication is cluster-wide, the LSNs that matter for logical replication are database specific. Since we don't expose the most recent commit's LSN there's no way to say "the WAL is currently 1250, the last commit happened at 1000, the slot has flushed up to 800, therefore there are at most 200 bytes replication needs to read through to catch up. I'm not sure I fully understand the problem. What are you doing currently to measure the lag? If you look at pg_replication_slots today, confirmed_flush_lsn advances also when you do pg_switch_wal(), so looking at the diff between confirmed_flush_lsn and pg_current_wal_lsn() works, no? And on the other hand, even if you expose the database's last commit LSN, you can have an publication that includes only a subset of tables. Or commits that don't write to any table at all. So I'm not sure why the database's last commit LSN is relevant. Getting the last LSN that did something that needs to be replicated through the publication might be useful, but that's not what what this patch does. -- Heikki Linnakangas Neon (https://neon.tech)
Re: Dump-restore loosing 'attnotnull' bit for DEFERRABLE PRIMARY KEY column(s).
On 2024-Mar-07, Alvaro Herrera wrote: > Maybe we can add a flag RelationData->rd_ispkdeferred, so that > RelationGetPrimaryKeyIndex returned InvalidOid for deferrable PKs; then > logical replication would continue to not know about this PK, which > perhaps is what we want. I'll do some testing with this. This seems to work okay. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ “Cuando no hay humildad las personas se degradan” (A. Christie) >From 8e3f913ca6a5bdbdbccac8d750ee9a6b6889f9aa Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Thu, 7 Mar 2024 16:44:35 +0100 Subject: [PATCH] Admit deferrable PKs into rd_pkindex, but flag them ... and don't return them as replica identity. --- src/backend/utils/cache/relcache.c| 26 ++- src/include/utils/rel.h | 1 + src/test/regress/expected/publication.out | 10 +++ .../regress/expected/replica_identity.out | 5 src/test/regress/sql/publication.sql | 9 +++ src/test/regress/sql/replica_identity.sql | 4 +++ 6 files changed, 48 insertions(+), 7 deletions(-) diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c index 4d339ee795..0460796cf5 100644 --- a/src/backend/utils/cache/relcache.c +++ b/src/backend/utils/cache/relcache.c @@ -4761,6 +4761,7 @@ RelationGetIndexList(Relation relation) char replident = relation->rd_rel->relreplident; Oid pkeyIndex = InvalidOid; Oid candidateIndex = InvalidOid; + bool pkdeferred = false; MemoryContext oldcxt; /* Quick exit if we already computed the list. */ @@ -4802,12 +4803,12 @@ RelationGetIndexList(Relation relation) result = lappend_oid(result, index->indexrelid); /* - * Non-unique, non-immediate or predicate indexes aren't interesting - * for either oid indexes or replication identity indexes, so don't - * check them. + * Non-unique or predicate indexes aren't interesting for either oid + * indexes or replication identity indexes, so don't check them. + * Deferred ones are not useful for replication identity either; but + * we do include them if they are PKs. */ if (!index->indisunique || - !index->indimmediate || !heap_attisnull(htup, Anum_pg_index_indpred, NULL)) continue; @@ -4832,7 +4833,13 @@ RelationGetIndexList(Relation relation) if (index->indisprimary && (index->indisvalid || relation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)) + { pkeyIndex = index->indexrelid; + pkdeferred = !index->indimmediate; + } + + if (!index->indimmediate) + continue; if (!index->indisvalid) continue; @@ -4854,7 +4861,8 @@ RelationGetIndexList(Relation relation) oldlist = relation->rd_indexlist; relation->rd_indexlist = list_copy(result); relation->rd_pkindex = pkeyIndex; - if (replident == REPLICA_IDENTITY_DEFAULT && OidIsValid(pkeyIndex)) + relation->rd_ispkdeferred = pkdeferred; + if (replident == REPLICA_IDENTITY_DEFAULT && OidIsValid(pkeyIndex) && !pkdeferred) relation->rd_replidindex = pkeyIndex; else if (replident == REPLICA_IDENTITY_INDEX && OidIsValid(candidateIndex)) relation->rd_replidindex = candidateIndex; @@ -4957,7 +4965,8 @@ RelationGetStatExtList(Relation relation) /* * RelationGetPrimaryKeyIndex -- get OID of the relation's primary key index * - * Returns InvalidOid if there is no such index. + * Returns InvalidOid if there is no such index, or if the primary key is + * DEFERRABLE. */ Oid RelationGetPrimaryKeyIndex(Relation relation) @@ -4972,7 +4981,7 @@ RelationGetPrimaryKeyIndex(Relation relation) Assert(relation->rd_indexvalid); } - return relation->rd_pkindex; + return relation->rd_ispkdeferred ? InvalidOid : relation->rd_pkindex; } /* @@ -5198,6 +5207,9 @@ RelationGetIndexPredicate(Relation relation) * Attribute numbers are offset by FirstLowInvalidHeapAttributeNumber so that * we can include system attributes (e.g., OID) in the bitmap representation. * + * Deferred indexes are considered for the primary key, but not for replica + * identity. + * * Caller had better hold at least RowExclusiveLock on the target relation * to ensure it is safe (deadlock-free) for us to take locks on the relation's * indexes. Note that since the introduction of CREATE INDEX CONCURRENTLY, diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h index b3ea2b2042..51f876b5b5 100644 --- a/src/include/utils/rel.h +++ b/src/include/utils/rel.h @@ -63,6 +63,7 @@ typedef struct RelationData bool rd_isvalid; /* relcache entry is valid */ bool rd_indexvalid; /* is rd_indexlist valid? (also rd_pkindex and * rd_replidindex) */ + bool rd_ispkdeferred; bool rd_statvalid; /* is rd_statlist valid? */ /*-- diff --git a/src/test/regress/expected/publication.out b/src/test/regress/expected/publication.out index 16361a91f9..0c5521d2aa 100644 --- a/src/test/regress/expected/publication.out +++ b/src/test/regress/expected/p
Re: Popcount optimization using AVX512
On Tue, Mar 05, 2024 at 04:52:23PM +, Amonson, Paul D wrote: > Noted I will try to do the "reduced" bottom-posting. I might slip up > occasionally because it's an Intel habit. No worries. > Is there a way to make Outlook do the leading ">" in a reply for the > previous message? I do not know, sorry. I personally use mutt for the lists. > BTW: Created the commit-fest submission. Thanks. I intend to provide a more detailed review shortly, as I am aiming to get this one committed for v17. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: improve ssl error code, 2147483650
Stephen Frost writes: > * Heikki Linnakangas (hlinn...@iki.fi) wrote: >> That's pretty unfortunate. As typical with OpenSSL, this stuff is not very >> well documented, but I think we could do something like this in >> SSLerrmessage(): >> >> if (ERR_SYSTEM_ERROR(e)) >> errreason = strerror(ERR_GET_REASON(e)); >> >> ERR_SYSTEM_ERROR only exists in OpenSSL 3.0 and above, and the only >> documentation I could find was in this one obscure place in the man pages: >> https://www.openssl.org/docs/man3.2/man3/BIO_dgram_get_local_addr_enable.html. >> But as a best-effort thing, it would still be better than "SSL error code >> 2147483650". > Agreed that it doesn't seem well documented. I was trying to figure out > what the 'right' answer here was myself and not having much success. If > the above works, then +1 to that. My reaction as well --- I was just gearing up to test this idea, unless one of you are already on it? regards, tom lane
Re: Potential stack overflow in incremental base backup
On Wed, Mar 6, 2024 at 12:44 AM Thomas Munro wrote: > I'll have to move that sucker onto the heap (we banned C99 variable > length arrays and we don't use nonstandard alloca()), but I think the > coding in master is already a bit dodgy: that's 131072 * > sizeof(BlockNumber) = 512kB with default configure options, but > --with-segsize X could create a stack variable up to 16GB, > corresponding to segment size 32TB (meaning no segmentation at all). > That wouldn't work. Shouldn't we move it to the stack? See attached > draft patch. > > Even on the heap, 16GB is too much to assume we can allocate during a > base backup. I don't claim that's a real-world problem for > incremental backup right now in master, because I don't have any > evidence that anyone ever really uses --with-segsize (do they?), but > if we make it an initdb option it will be more popular and this will > become a problem. Hmm. I don't see a problem with moving it from the stack to the heap. I don't believe I had any particular reason for wanting it on the stack specifically. I'm not sure what to do about really large segment sizes. As long as the allocation here is < ~100MB, it's probably fine, both because (1) 100MB isn't an enormously large allocation these days and (2) if you have a good reason to increase the segment size by a factor of 256 or more, you're probably running on a big machine, and then you should definitely have 100MB to burn. However, a 32TB segment size is another thing altogether. We could avoid transposing relative block numbers to absolute block numbers whenever start_blkno is 0, but that doesn't do a whole lot when the segment size is 8TB or 16TB rather than 32TB; plus, in such cases, the relative block number array is also going to be gigantic. Maybe what we need to do is figure out how to dynamically size the arrays in such cases, so that you only make them as big as required for the file you actually have, rather than as big as the file could theoretically be. But I think that's really only necessary if we're actually going to get rid of the idea of segmented relations altogether, which I don't think is happening at least for v17, and maybe not ever. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Popcount optimization using AVX512
On 2024-Mar-04, Amonson, Paul D wrote: > > -#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62)) > > +#define LARGE_OFF_T off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) > > +<< 31)) > > > > IME this means that the autoconf you are using has been patched. A > > quick search on the mailing lists seems to indicate that it might be > > specific to Debian [1]. > > I am not sure what the ask is here? I made changes to the > configure.ac and ran autoconf2.69 to get builds to succeed. Do you > have a separate feedback here? So what happens here is that autoconf-2.69 as shipped by Debian contains some patches on top of the one released by GNU. We use the latter, so if you run Debian's, then the generated configure script will contain the differences coming from Debian's version. Really, I don't think this is very important as a review point, because if the configure.ac file is changed in the patch, it's best for the committer to run autoconf on their own, using a pristine GNU autoconf; the configure file in the submitted patch is not relevant, only configure.ac matters. What committers do (or should do) is keep an install of autoconf-2.69 straight from GNU. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/
Re: Potential stack overflow in incremental base backup
On Wed, Mar 6, 2024 at 6:29 AM Alvaro Herrera wrote: > On 2024-Mar-06, Thomas Munro wrote: > > Even on the heap, 16GB is too much to assume we can allocate during a > > base backup. I don't claim that's a real-world problem for > > incremental backup right now in master, because I don't have any > > evidence that anyone ever really uses --with-segsize (do they?), but > > if we make it an initdb option it will be more popular and this will > > become a problem. Hmm. > > Would it work to use a radix tree from the patchset at > https://postgr.es/m/canwcazb43znrk03bzftnvrafhzngzh26sjc0ep-sj8+w20v...@mail.gmail.com > ? Probably not that much, because we actually send the array to the client very soon after we construct it: push_to_sink(sink, &checksum_ctx, &header_bytes_done, incremental_blocks, sizeof(BlockNumber) * num_incremental_blocks); This is hard to do without materializing the array somewhere, so I don't think an alternate representation is the way to go in this instance. -- Robert Haas EDB: http://www.enterprisedb.com
Re: why there is not VACUUM FULL CONCURRENTLY?
On 2024-Feb-16, Antonin Houska wrote: > BTW, I'm failing to understand why cluster_rel() has no argument of the > BufferAccessStrategy type. According to buffer/README, the criterion for using > specific strategy is that page "is unlikely to be needed again > soon". Specifically for cluster_rel(), the page will *definitely* not be used > again (unless the VACCUM FULL/CLUSTER command fails): BufferTag contains the > relatin file number and the old relation file is eventually dropped. > > Am I missing anything? No, that's just an oversight. Access strategies are newer than that cluster code. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Most hackers will be perfectly comfortable conceptualizing users as entropy sources, so let's move on." (Nathaniel Smith) https://mail.gnu.org/archive/html/monotone-devel/2007-01/msg00080.html
Re: Popcount optimization using AVX512
On Thu, Mar 07, 2024 at 06:53:12PM +0100, Alvaro Herrera wrote: > Really, I don't think this is very important as a review point, because > if the configure.ac file is changed in the patch, it's best for the > committer to run autoconf on their own, using a pristine GNU autoconf; > the configure file in the submitted patch is not relevant, only > configure.ac matters. Agreed. I didn't intend for this to be a major review point, and I apologize for the extra noise. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: improve ssl error code, 2147483650
David Zhang writes: > When configuring SSL on the Postgres server side with the following > information: > ssl = on > ssl_ca_file = 'root_ca.crt' > ssl_cert_file = 'server-cn-only.crt' > ssl_key_file = 'server-cn-only.key' > If a user makes a mistake, for example, accidentally using 'root_ca.crl' > instead of 'root_ca.crt', Postgres will report an error like the one below: > FATAL: could not load root certificate file "root_ca.crl": SSL error > code 2147483650 Interestingly, this works fine for me on RHEL8 (with openssl-1.1.1k): 2024-03-07 12:57:53.432 EST [547522] FATAL: F: could not load root certificate file "foo.bar": No such file or directory 2024-03-07 12:57:53.432 EST [547522] LOCATION: be_tls_init, be-secure-openssl.c:306 I do reproduce your problem on Fedora 39 with openssl-3.1.1. So this seems to be a regression on OpenSSL's part. Maybe they'll figure out how to fix it sometime; that seems to be another good argument for not pre-empting their error handling. regards, tom lane
Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
Hello! > I'm not a fan of this approach. Changing visibility and cleanup > semantics to only benefit R/CIC sounds like a pain to work with in > essentially all visibility-related code. I'd much rather have to deal > with another index AM, even if it takes more time: the changes in > semantics will be limited to a new plug in the index AM system and a > behaviour change in R/CIC, rather than behaviour that changes in all > visibility-checking code. Technically, this does not affect the visibility logic, only the clearing semantics. All visibility related code remains untouched. But yes, still an inelegant and a little strange-looking option. At the same time, perhaps it can be dressed in luxury somehow - for example, add as a first class citizen in ComputeXidHorizonsResult a list of blocks to clear some relations. > But regardless of second scan snapshots, I think we can worry about > that part at a later moment: The first scan phase is usually the most > expensive and takes the most time of all phases that hold snapshots, > and in the above discussion we agreed that we can already reduce the > time that a snapshot is held during that phase significantly. Sure, it > isn't great that we have to scan the table again with only a single > snapshot, but generally phase 2 doesn't have that much to do (except > when BRIN indexes are involved) so this is likely less of an issue. > And even if it is, we would still have reduced the number of > long-lived snapshots by half. Hmm, but it looks like we don't have the infrastructure to "update" xmin propagating to the horizon after the first snapshot in a transaction is taken. One option I know of is to reuse the d9d076222f5b94a85e0e318339cfc44b8f26022d (1) approach. But if this is the case, then there is no point in re-taking the snapshot again during the first phase - just apply this "if" only for the first phase - and you're done. Do you know any less-hacky way? Or is it a nice way to go? [1]: https://github.com/postgres/postgres/commit/d9d076222f5b94a85e0e318339cfc44b8f26022d#diff-8879f0173be303070ab7931db7c757c96796d84402640b9e386a4150ed97b179R1779-R1793
Re: Add system identifier to backup manifest
On Thu, Mar 7, 2024 at 9:16 AM Robert Haas wrote: > It could. I just thought this was clearer. I agree that it's a bit > long, but I don't think this is worth bikeshedding very much. If at a > later time somebody feels strongly that it needs to be changed, so be > it. Right now, getting on with the business at hand is more important, > IMHO. Here's a new version of the patch set, rebased over my version of 0001 and with various other corrections: * Tidy up grammar in documentation. * In manifest_process_version, the test checked whether the manifest version == 1, but the comment talked about versions >= 2. Make the comment match the code. * In load_backup_manifest, avoid changing the existing placement of a variable declaration. * Rename verify_system_identifier to verify_control_file because if we were verifying multiple things about the control file we'd still want to only read it one. * Tweak the coding of verify_backup_file and verify_control_file to avoid repeated path construction. * Remove saw_system_identifier_field. This looks like it's trying to enforce a rule that the system identifier must immediately follow the version, but we don't insist on anything like that for files or wal ranges, so there seems to be no reason to do it here. * Remove bogus "unrecognized top-level field" test from 005_bad_manifest.pl. The JSON included here doesn't include any unrecognized top-level field, so the fact that we were getting that error message was wrong. After removing saw_system_identifier_field, we no longer get the wrong error message any more, so the test started failing. * Remove "expected system identifier" test from 005_bad_manifest.pl. This was basically a test that saw_system_identifier_field was working. * Header comment adjustment for json_manifest_finalize_system_identifier. The last sentence was cut-and-pasted from somewhere that it made sense to here, where it doesn't. There's only ever one system identifier. I plan to commit this, barring objections. -- Robert Haas EDB: http://www.enterprisedb.com v11-0001-Expose-new-function-get_controlfile_by_exact_pat.patch Description: Binary data v11-0002-Add-the-system-identifier-to-backup-manifests.patch Description: Binary data
Re: improve ssl error code, 2147483650
I wrote: > Stephen Frost writes: >> Agreed that it doesn't seem well documented. I was trying to figure out >> what the 'right' answer here was myself and not having much success. If >> the above works, then +1 to that. > My reaction as well --- I was just gearing up to test this idea, > unless one of you are already on it? I've confirmed that this: diff --git a/src/backend/libpq/be-secure-openssl.c b/src/backend/libpq/be-secure-openssl.c index e12b1cc9e3..47eee4b59d 100644 --- a/src/backend/libpq/be-secure-openssl.c +++ b/src/backend/libpq/be-secure-openssl.c @@ -1363,6 +1363,10 @@ SSLerrmessage(unsigned long ecode) errreason = ERR_reason_error_string(ecode); if (errreason != NULL) return errreason; +#ifdef ERR_SYSTEM_ERROR + if (ERR_SYSTEM_ERROR(ecode)) + return strerror(ERR_GET_REASON(ecode)); +#endif snprintf(errbuf, sizeof(errbuf), _("SSL error code %lu"), ecode); return errbuf; } seems to be enough to fix the problem on OpenSSL 3.1.1. The #ifdef is needed to avoid compile failure against OpenSSL 1.1.1 --- but that version doesn't have the problem, so we don't need to sweat. This could probably do with a comment, and we need to propagate the fix into libpq's copy of the function too. Barring objections, I'll take care of that and push it later today. regards, tom lane
Re: improve ssl error code, 2147483650
> On 7 Mar 2024, at 20:58, Tom Lane wrote: > > I wrote: >> Stephen Frost writes: >>> Agreed that it doesn't seem well documented. I was trying to figure out >>> what the 'right' answer here was myself and not having much success. If >>> the above works, then +1 to that. > >> My reaction as well --- I was just gearing up to test this idea, >> unless one of you are already on it? > > I've confirmed that this: > > diff --git a/src/backend/libpq/be-secure-openssl.c > b/src/backend/libpq/be-secure-openssl.c > index e12b1cc9e3..47eee4b59d 100644 > --- a/src/backend/libpq/be-secure-openssl.c > +++ b/src/backend/libpq/be-secure-openssl.c > @@ -1363,6 +1363,10 @@ SSLerrmessage(unsigned long ecode) > errreason = ERR_reason_error_string(ecode); > if (errreason != NULL) > return errreason; > +#ifdef ERR_SYSTEM_ERROR > + if (ERR_SYSTEM_ERROR(ecode)) > + return strerror(ERR_GET_REASON(ecode)); > +#endif > snprintf(errbuf, sizeof(errbuf), _("SSL error code %lu"), ecode); > return errbuf; > } > > seems to be enough to fix the problem on OpenSSL 3.1.1. The #ifdef > is needed to avoid compile failure against OpenSSL 1.1.1 --- but that > version doesn't have the problem, so we don't need to sweat. This was introduced in OpenSSL 3.0.0 so that makes sense. Pre-3.0.0 versions truncates system errorcodes that was outside of the range 1..127 reserving the rest for OpenSSL specific errors. To capture the full range possible of system errors the code is no longer truncated and the ERR_SYSTEM_FLAG flag is set, which can be tested for with the macro used here. > This could probably do with a comment, and we need to propagate > the fix into libpq's copy of the function too. Barring objections, > I'll take care of that and push it later today. LGTM. -- Daniel Gustafsson
Re: POC, WIP: OR-clause support for indexes
Hi! On 07.03.2024 17:51, Alexander Korotkov wrote: Hi! On Tue, Mar 5, 2024 at 9:59 AM Andrei Lepikhov wrote: > On 5/3/2024 12:30, Andrei Lepikhov wrote: > > On 4/3/2024 09:26, jian he wrote: > ... and the new version of the patchset is attached. I made some revisions for the patchset. 1) Use hash_combine() to combine hash values. 2) Upper limit the number of array elements by MAX_SAOP_ARRAY_SIZE. 3) Better save the original order of clauses by putting hash entries and untransformable clauses to the same list. A lot of differences in regression tests output have gone. Thank you for your changes. I agree with them. One important issue I found. # create table t as (select i::int%100 i from generate_series(1,1) i); # analyze t; # explain select * from t where i = 1 or i = 1; QUERY PLAN - Seq Scan on t (cost=0.00..189.00 rows=200 width=4) Filter: (i = ANY ('{1,1}'::integer[])) (2 rows) # set enable_or_transformation = false; SET # explain select * from t where i = 1 or i = 1; QUERY PLAN - Seq Scan on t (cost=0.00..189.00 rows=100 width=4) Filter: (i = 1) (2 rows) We don't make array values unique. That might make query execution performance somewhat worse, and also makes selectivity estimation worse. I suggest Andrei and/or Alena should implement making array values unique. I have corrected this and some spelling mistakes. The unique_any_elements_change.no-cfbot file contains changes. While I was correcting the test results caused by such changes, I noticed that the same behavior was when converting the IN expression, and this can be seen in the result of the regression test: EXPLAIN (COSTS OFF) SELECT unique2 FROM onek2 WHERE stringu1 IN ('A', 'A') AND (stringu1 = 'A' OR stringu1 = 'A'); QUERY PLAN --- Bitmap Heap Scan on onek2 Recheck Cond: (stringu1 < 'B'::name) Filter: ((stringu1 = ANY ('{A,A}'::name[])) AND (stringu1 = 'A'::name)) -> Bitmap Index Scan on onek2_u2_prtl (4 rows) -- Regards, Alena Rybakina Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/src/backend/parser/parse_expr.c b/src/backend/parser/parse_expr.c index baeb83e58b..e9813ef6ab 100644 --- a/src/backend/parser/parse_expr.c +++ b/src/backend/parser/parse_expr.c @@ -312,8 +312,8 @@ TransformOrExprToANY(ParseState *pstate, List *args, int location) if (unlikely(found)) { - entry->consts = lappend(entry->consts, const_expr); - entry->exprs = lappend(entry->exprs, orqual); + entry->consts = list_append_unique(entry->consts, const_expr); + entry->exprs = list_append_unique(entry->exprs, orqual); } else { @@ -352,7 +352,6 @@ TransformOrExprToANY(ParseState *pstate, List *args, int location) entry = (OrClauseGroupEntry *) lfirst(lc); Assert(list_length(entry->consts) > 0); - Assert(list_length(entry->exprs) == list_length(entry->consts)); if (list_length(entry->consts) == 1 || list_length(entry->consts) > MAX_SAOP_ARRAY_SIZE) diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out index 7b721bac71..606a4399f9 100644 --- a/src/test/regress/expected/create_index.out +++ b/src/test/regress/expected/create_index.out @@ -1915,16 +1915,16 @@ SELECT count(*) FROM tenk1 EXPLAIN (COSTS OFF) SELECT count(*) FROM tenk1 WHERE hundred = 42 AND (thousand < 42 OR thousand < 99 OR 43 > thousand OR 42 > thousand); -QUERY PLAN --- + QUERY PLAN +--- Aggregate -> Bitmap Heap Scan on tenk1 - Recheck Cond: ((hundred = 42) AND (thousand < ANY ('{42,99,43,42}'::integer[]))) + Recheck Cond: ((hundred = 42) AND (thousand < ANY ('{42,99,43}'::integer[]))) -> BitmapAnd -> Bitmap Index Scan on tenk1_hundred Index Cond: (hundred = 42) -> Bitmap Index Scan on tenk1_thous_tenthous - Index Cond: (thousand < ANY ('{42,99,43,42}'::integer[])) + Index Cond: (thousand < ANY ('{42,99,43}'::integer[])) (8 rows) EXPLAIN (COSTS OFF) diff --git a/src/test/regress/expected/select.out b/src/test/regress/expected/select.out index 0ebaf002e8..f4f5493c43 100644 --- a/src/test/r
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Wed, Mar 6, 2024 at 4:46 PM Matthias van de Meent wrote: > On Wed, 6 Mar 2024 at 01:50, Peter Geoghegan wrote: > > I think that there is supposed to be a closing parenthesis here? So > > "... (such as those described in ") might > > perform...". > > Correct. > > > FWM, if that's what you meant. > > WFM, yes? Then we're in agreement on this. > > At one point Heikki suggested that I just get rid of > > BTScanOpaqueData.arrayKeyData (which has been there for as long as > > nbtree had native support for SAOPs), and use > > BTScanOpaqueData.keyData exclusively instead. I've finally got around > > to doing that now. > > I'm not sure if it was worth the reduced churn when the changes for > the primary optimization are already 150+kB in size; every "small" > addition increases the context needed to review the patch, and it's > already quite complex. I agree that the patch is quite complex, especially relative to its size. > > Note that we no longer need to have an independent representation of > > so->qual_okay for array keys (the convention of setting > > so->numArrayKeys to -1 for unsatisfiable array keys is no longer > > required). There is no longer any need for a separate pass to carry > > over the contents of BTScanOpaqueData.arrayKeyData to > > BTScanOpaqueData.keyData, which was confusing. > > I wasn't very confused by it, but sure. > > > Are you still interested in working directly on the preprocessing > > stuff? > > If you mean my proposed change to merging two equality AOPs, then yes. > I'll try to fit it in tomorrow with the rest of the review. I didn't just mean that stuff. I was also suggesting that you could join the project directly (not just as a reviewer). If you're interested, you could do general work on the preprocessing of arrays. Fancier array-specific preprocessing. For example, something that can transform this: select * from foo where a in (1,2,3) and a < 3; Into this: select * from foo where a in (1,2); Or, something that can just set qual_okay=false, given a query such as: select * from foo where a in (1,2,3) and a < 5; This is clearly doable by reusing the binary search code during preprocessing. The first example transformation could work via a binary search for the constant "3", followed by a memmove() to shrink the array in-place (plus the inequality itself would need to be fully eliminated). The second example would work a little like the array merging thing that the patch has already, except that there'd only be one array involved (there wouldn't be a pair of arrays). > > select * from multi_test where a in (1,99, 182, 183, 184) and a < 183; > > > > Maybe we need to do better with that. What do you think? > > Let me come back to that when I'm done reviewing the final part of nbtutils. Even if this case doesn't matter (which I now doubt), it's probably easier to just get it right than it would be to figure out if we can live without it. > > void > > _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir) > > @@ -519,159 +888,1163 @@ _bt_start_array_keys(IndexScanDesc scan, > > ScanDirection dir) > > BTScanOpaque so = (BTScanOpaque) scan->opaque; > > inti; > > > > +Assert(so->numArrayKeys); > > +Assert(so->qual_ok); > > Has the requirement for a known scan direction been removed, or should > this still have an Assert(dir != NoMovementScanDirection)? I agree that such an assertion is worth having. Added that locally. > I'm not sure that comment is correct, at least it isn't as clear as > can be. Maybe something more in the line of the following? > +pstate.prechecked = false;/* prechecked didn't cover HIKEY */ I agree that that's a little better than what I had. > +++ b/src/backend/access/nbtree/nbtutils.c > > @@ -272,7 +319,32 @@ _bt_preprocess_array_keys(IndexScanDesc scan) > > +elemtype = cur->sk_subtype; > > +if (elemtype == InvalidOid) > > +elemtype = rel->rd_opcintype[cur->sk_attno - 1]; > > Should we Assert() that this elemtype matches the array element type > ARR_ELEMTYPE(arrayval) used to deconstruct the array? Yeah, good idea. > This is not "a" previous equality key, but "the" previous equality > operator array scan key. > Do we want to expend some additional cycles for detecting duplicate > equality array key types in interleaved order like =int[] =bigint[] > =int[]? I don't think it would be very expensive considering the > limited number of cross-type equality operators registered in default > PostgreSQL, so a simple loop that checks matching element types > starting at the first array key scankey for this attribute should > suffice. We could even sort the keys by element type if we wanted to > fix any O(n) issues for this type of behaviour (though this is > _extremely_ unlikely to become a performance issue). Yes, I think that we should probably have this (though likely wouldn't bother sorting the scan keys themselves). You'd need to look-up cross-type operators for this
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Wed, Mar 6, 2024 at 4:51 PM Matthias van de Meent wrote: > To clarify, what I mean here is that merging the changes of both the > SAOPs changes and the removal of arrayKeyData seems to increase the > patch size and increases the maximum complexity of each component > patch's review. Removing arrayKeyData probably makes the patch very slightly smaller, actually. But even if it's really the other way around, I'd still like to get rid of it as part of the same commit as everything else. It just makes sense that way. > Separate patches may make this more reviewable, or not, but no comment > was given on why it is better to merge the changes into a single > patch. Fair enough. Here's why: The original SAOP design (commit 9e8da0f7, "Teach btree to handle ScalarArrayOpExpr quals natively") added a layer of indirection between scan->keyData (input scan keys) and so->keyData (output scan keys): it added another scan key array, so->arrayKeyData. There was array-specific preprocessing in _bt_preprocess_array_keys, that happened before the first primitive index scan even began -- that transformed our true input scan keys (scan->keyData) into a copy of the array with limited amounts of array-specific preprocessing already performed (so->arrayKeyData). This made a certain amount of sense at the time, because _bt_preprocess_keys was intended to be called once per primitive index scan. Kind of like the inner side of a nested loop join's inner index scan, where we call _bt_preprocess_keys once per inner-side scan/btrescan call. (Actually, Tom's design has us call _bt_preprocess once per primitive index scan per btrescan call, which might matter in those rare cases where the inner side of a nestloop join had SAOP quals.) What I now propose to do is to just call _bt_preprocess_keys once per btrescan (actually, it's still called once per primitive index scan, but all calls after the first are just no-ops after v12 of the patch). This makes sense because SAOP array constants aren't like nestloop joins with an inner index scans, in one important way: we really can see everything up-front. We can see all of the array elements, and operate on whole arrays as necessary during preprocessing (e.g., performing the array merging thing I added to _bt_preprocess_array_keys). It's not like the next array element is only visible to prepocessing only after the outer side of a nestloop join runs, and next calls btrescan -- so why treat it like that? Conceptually, "WHERE a = 1" is almost the same thing as "WHERE a = any('{1,2}')", so why not use essentially the same approach to preprocessing in both cases? We don't need to copy the input keys into so->arrayKeyData, because the indirection (which is a bit like a fake nested loop join) doesn't buy us anything. v13 of the patch doesn't quite 100% eliminate so->arrayKeyData. While it removes the arrayKeyData field from the BTScanOpaqueData struct, we still use a temporary array (accessed via a pointer that's just a local variable) that's also called arrayKeyData. And that also stores array-preprocessing-only copies of the input scan keys. That happens within _bt_preprocess_keys. So we do "still need arrayKeyData", but we only need it for a brief period at the start of the index scan. It just doesn't make any sense to keep it around for longer than that, in a world where _bt_preprocess_keys "operates directly on arrays". That only made sense (a bit of sense) back when _bt_preprocess_keys was subordinate to _bt_preprocess_array_keys, but it's kinda the other way around now. We could probably even get rid of this remaining limited form of arrayKeyData, but that doesn't seem like it would add much. -- Peter Geoghegan
Re: Popcount optimization using AVX512
As promised... > +# Check for Intel AVX512 intrinsics to do POPCNT calculations. > +# > +PGAC_AVX512_POPCNT_INTRINSICS([]) > +if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then > + PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq -mavx512f]) > +fi > +AC_SUBST(CFLAGS_AVX512_POPCNT) I'm curious why we need both -mavx512vpopcntdq and -mavx512f. On my machine, -mavx512vpopcntdq alone is enough to pass this test, so if there are other instructions required that need -mavx512f, then we might need to expand the test. > 13 files changed, 657 insertions(+), 119 deletions(-) I still think it's worth breaking this change into at least 2 patches. In particular, I think there's an opportunity to do the refactoring into pg_popcnt_choose.c and pg_popcnt_x86_64_accel.c prior to adding the AVX512 stuff. These changes are likely straightforward, and getting them out of the way early would make it easier to focus on the more interesting changes. IMHO there are a lot of moving parts in this patch. > +#undef HAVE__GET_CPUID_COUNT > + > +/* Define to 1 if you have immintrin. */ > +#undef HAVE__IMMINTRIN Is this missing HAVE__CPUIDEX? > uint64 > -pg_popcount(const char *buf, int bytes) > +pg_popcount_slow(const char *buf, int bytes) > { > uint64 popcnt = 0; > > -#if SIZEOF_VOID_P >= 8 > +#if SIZEOF_VOID_P == 8 > /* Process in 64-bit chunks if the buffer is aligned. */ > if (buf == (const char *) TYPEALIGN(8, buf)) > { > @@ -311,7 +224,7 @@ pg_popcount(const char *buf, int bytes) > > buf = (const char *) words; > } > -#else > +#elif SIZEOF_VOID_P == 4 > /* Process in 32-bit chunks if the buffer is aligned. */ > if (buf == (const char *) TYPEALIGN(4, buf)) > { Apologies for harping on this, but I'm still not seeing the need for these SIZEOF_VOID_P changes. While it's unlikely that this makes any practical difference, I see no reason to more strictly check SIZEOF_VOID_P here. > +/* Process any remaining bytes */ > +while (bytes--) > +popcnt += pg_number_of_ones[(unsigned char) *buf++]; > +return popcnt; > +#else > +return pg_popcount_slow(buf, bytes); > +#endif /* USE_AVX512_CODE */ nitpick: Could we call pg_popcount_slow() in a common section for these "remaining bytes?" > +#if defined(_MSC_VER) > +pg_popcount_indirect = pg_popcount512_fast; > +#else > +pg_popcount = pg_popcount512_fast; > +#endif These _MSC_VER sections are interesting. I'm assuming this is the workaround for the MSVC linking issue you mentioned above. I haven't looked too closely, but I wonder if the CRC32C code (see src/include/port/pg_crc32c.h) is doing something different to avoid this issue. Upthread, Alvaro suggested a benchmark [0] that might be useful. I scanned through this thread and didn't see any recent benchmark results for the latest form of the patch. I think it's worth verifying that we are still seeing the expected improvements. [0] https://postgr.es/m/202402071953.5c4z7t6kl7ts%40alvherre.pgsql -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: postgres_fdw test timeouts
On Sun, Dec 10, 2023 at 12:00:01PM +0300, Alexander Lakhin wrote: > So I would not say that it's a dominant failure for now, and given that > 04a09ee lives in master only, maybe we can save two commits (Revert + > Revert of revert) while moving to a more persistent solution. I just checked in on this one to see whether we needed to create an "open item" for v17. While I'm not seeing the failures anymore, the patch that Alexander claimed should fix it [0] doesn't appear to have been committed, either. Perhaps this was fixed by something else... [0] https://postgr.es/m/CA%2BhUKGL0bikWSC2XW-zUgFWNVEpD_gEWXndi2PE5tWqmApkpZQ%40mail.gmail.com -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: pg_column_toast_chunk_id: a function to get a chunk ID of a TOASTed value
On Mon, Feb 05, 2024 at 04:28:23PM +0900, Yugo NAGATA wrote: > On Thu, 1 Feb 2024 17:59:56 +0800 > jian he wrote: >> v6 patch looks good. > > Thank you for your review and updating the status to RwC! I think this one needs a (pretty trivial) rebase. I spent a few minutes testing it out and looking at the code, and it seems generally reasonable to me. Do you think it's worth adding something like a pg_column_toast_num_chunks() function that returns the number of chunks for the TOASTed value, too? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
CREATE TABLE creates a composite type corresponding to the table row, which is and is not there
I could not find any explanation of the following behaviour in docs - Our documentation for CREATE TABLE says: CREATE TABLE also automatically creates a data type that represents the composite type corresponding to one row of the table. Therefore, tables cannot have the same name as any existing data type in the same schema. But these composite tables are only sometimes there hannuk=# CREATE TABLE pair(a int, b int); CREATE TABLE hannuk=# INSERT INTO pair VALUES(1,2); INSERT 0 1 hannuk=# select pg_typeof(p) from pair as p; pg_typeof --- pair hannuk=# select pg_typeof(pg_typeof(p)) from pair as p; pg_typeof --- regtype # first case where I can not use the table-defined type hannuk=# create table anoter_pair of pair; ERROR: type pair is not a composite type # the type definitely is there as promised hannuk=# create type pair as (a int, b int); ERROR: type "pair" already exists # and I can create similar type wit other name and use it to create table hannuk=# create type pair2 as (a int, b int); CREATE TYPE hannuk=# create table anoter_pair of pair2; CREATE TABLE # and i can even use it in LIKE hannuk=# CREATE TABLE pair3(like pair2); CREATE TABLE # the type is present in pg_type with type 'c' for Composite hannuk=# select typname, typtype from pg_type where typname = 'pair'; typname | typtype -+- pair| c (1 row) # and I can add comment to the type hannuk=# COMMENT ON TYPE pair is 'A Shroedingers type'; COMMENT # but \dT does not show it (second case) hannuk=# \dT pair List of data types Schema | Name | Description +--+- (0 rows) --- Hannu
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Thu, 7 Mar 2024 15:32:01 +0900, Michael Paquier wrote: > While on it, here are some profiles based on HEAD and v17 with the > previous tests (COPY TO /dev/null, COPY FROM data sent to the void). > ... > > So, in short, and that's not really a surprise, there is no effect > once we use the dispatching with the routines only when a format would > want to plug-in with the APIs, but a custom format would still have a > penalty of a few percents for both if bottlenecked on CPU. Thanks for sharing these profiles! I agree with you. This shows that the v17 approach doesn't affect the current text/csv/binary implementations. (The v17 approach just adds 2 new structs, Copy{From,To}Rountine, without changing the current text/csv/binary implementations.) Can we push the v17 patch and proceed following implementations? Could someone (especially a PostgreSQL committer) take a look at this for double-check? Thanks, -- kou
Re: improve ssl error code, 2147483650
Daniel Gustafsson writes: > On 7 Mar 2024, at 20:58, Tom Lane wrote: >> This could probably do with a comment, and we need to propagate >> the fix into libpq's copy of the function too. Barring objections, >> I'll take care of that and push it later today. > LGTM. Done so far as be-secure-openssl.c and fe-secure-openssl.c are concerned. But I noticed that src/common/cryptohash_openssl.c and src/common/hmac_openssl.c have their own, rather half-baked versions of SSLerrmessage. I didn't do anything about that in the initial patch, because it's not clear to me that those routines would ever see system-errno-based errors, plus their comments claim that returning NULL isn't terribly bad. But if we want to do something about it, I don't think that maintaining 3 copies of the code is the way to go. Maybe push be-secure-openssl.c's version into src/common? regards, tom lane
Re: Confine vacuum skip logic to lazy_scan_skip
On Wed, Mar 06, 2024 at 10:00:23PM -0500, Melanie Plageman wrote: > On Wed, Mar 06, 2024 at 09:55:21PM +0200, Heikki Linnakangas wrote: > > I made some further changes. I kept them as separate commits for easier > > review, see the commit messages for details. Any thoughts on those changes? > > I've given some inline feedback on most of the extra patches you added. > Short answer is they all seem fine to me except I have a reservations > about 0008 because of the number of blkno variables flying around. I > didn't have a chance to rebase these into my existing changes today, so > either I will do it tomorrow or, if you are feeling like you're on a > roll and want to do it, that also works! Attached v7 contains all of the changes that you suggested plus some additional cleanups here and there. > > I feel heap_vac_scan_get_next_block() function could use some love. Maybe > > just some rewording of the comments, or maybe some other refactoring; not > > sure. But I'm pretty happy with the function signature and how it's called. I've cleaned up the comments on heap_vac_scan_next_block() in the first couple patches (not so much in the streaming read user). Let me know if it addresses your feelings or if I should look for other things I could change. I will say that now all of the variable names are *very* long. I didn't want to remove the "state" from LVRelState->next_block_state. (In fact, I kind of miss the "get". But I had to draw the line somewhere.) I think without "state" in the name, next_block sounds too much like a function. Any ideas for shortening the names of next_block_state and its members or are you fine with them? > I was wondering if we should remove the "get" and just go with > heap_vac_scan_next_block(). I didn't do that originally because I didn't > want to imply that the next block was literally the sequentially next > block, but I think maybe I was overthinking it. > > Another idea is to call it heap_scan_vac_next_block() and then the order > of the words is more like the table AM functions that get the next block > (e.g. heapam_scan_bitmap_next_block()). Though maybe we don't want it to > be too similar to those since this isn't a table AM callback. I've done a version of this. > > From 27e431e8dc69bbf09d831cb1cf2903d16f177d74 Mon Sep 17 00:00:00 2001 > > From: Heikki Linnakangas > > Date: Wed, 6 Mar 2024 20:58:57 +0200 > > Subject: [PATCH v6 6/9] Move vmbuffer back to a local varible in > > lazy_scan_heap() > > > > It felt confusing that we passed around the current block, 'blkno', as > > an argument to lazy_scan_new_or_empty() and lazy_scan_prune(), but > > 'vmbuffer' was accessed directly in the 'scan_state'. > > > > It was also a bit vague, when exactly 'vmbuffer' was valid. Calling > > heap_vac_scan_get_next_block() set it, sometimes, to a buffer that > > might or might not contain the VM bit for 'blkno'. But other > > functions, like lazy_scan_prune(), assumed it to contain the correct > > buffer. That was fixed up visibilitymap_pin(). But clearly it was not > > "owned" by heap_vac_scan_get_next_block(), like the other 'scan_state' > > fields. > > > > I moved it back to a local variable, like it was. Maybe there would be > > even better ways to handle it, but at least this is not worse than > > what we have in master currently. > > I'm fine with this. I did it the way I did (grouping it with the > "next_unskippable_block" in the skip struct), because I think that this > vmbuffer is always the buffer containing the VM bit for the next > unskippable block -- which sometimes is the block returned by > heap_vac_scan_get_next_block() and sometimes isn't. > > I agree it might be best as a local variable but perhaps we could retain > the comment about it being the block of the VM containing the bit for the > next unskippable block. (Honestly, the whole thing is very confusing). In 0001-0004 I've stuck with only having the local variable vmbuffer in lazy_scan_heap(). In 0006 (introducing pass 1 vacuum streaming read user) I added a vmbuffer back to the next_block_state (while also keeping the local variable vmbuffer in lazy_scan_heap()). The vmbuffer in lazy_scan_heap() contains the block of the VM containing visi information for the next unskippable block or for the current block if its visi information happens to be in the same block of the VM as either 1) the next unskippable block or 2) the most recently processed heap block. Streaming read vacuum separates this visibility check in heap_vac_scan_next_block() from the main loop of lazy_scan_heap(), so we can't just use a local variable anymore. Now the local variable vmbuffer in lazy_scan_heap() will only already contain the block with the visi information for the to-be-processed block if it happens to be in the same VM block as the most recently processed heap block. That means potentially more VM fetches. However, by adding a vmbuffer to next_block_state, the callback may be able to avoid extra VM fetches from one i
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Mar 7, 2024 at 11:15 PM Masahiko Sawada wrote: > > It looks like it requires a link with pgport_srv but I'm not sure. It > seems that the recent commit 1f1d73a8b breaks CI, Windows - Server > 2019, VS 2019 - Meson & ninja, too. Unfortunately, none of the Windows animals happened to run both after the initial commit and before removing the (seemingly useless on our daily platfoms) link. I'll confirm on my own CI branch in a few minutes.
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Mar 8, 2024 at 10:04 AM John Naylor wrote: > > On Thu, Mar 7, 2024 at 11:15 PM Masahiko Sawada wrote: > > > > It looks like it requires a link with pgport_srv but I'm not sure. It > > seems that the recent commit 1f1d73a8b breaks CI, Windows - Server > > 2019, VS 2019 - Meson & ninja, too. > > Unfortunately, none of the Windows animals happened to run both after > the initial commit and before removing the (seemingly useless on our > daily platfoms) link. I'll confirm on my own CI branch in a few > minutes. Yesterday I've confirmed the something like the below fixes the problem happened in Windows CI: --- a/src/test/modules/test_radixtree/meson.build +++ b/src/test/modules/test_radixtree/meson.build @@ -12,6 +12,7 @@ endif test_radixtree = shared_module('test_radixtree', test_radixtree_sources, + link_with: host_system == 'windows' ? pgport_srv : [], kwargs: pg_test_mod_args, ) test_install_libs += test_radixtree But I'm not sure it's the right fix especially because I guess it could raise "AddressSanitizer: odr-violation" error on Windows. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Mar 8, 2024 at 8:09 AM Masahiko Sawada wrote: > > Yesterday I've confirmed the something like the below fixes the > problem happened in Windows CI: Glad you shared before I went and did it. > --- a/src/test/modules/test_radixtree/meson.build > +++ b/src/test/modules/test_radixtree/meson.build > @@ -12,6 +12,7 @@ endif > > test_radixtree = shared_module('test_radixtree', >test_radixtree_sources, > + link_with: host_system == 'windows' ? pgport_srv : [], I don't see any similar coding elsewhere, so that leaves me wondering if we're missing something. On the other hand, maybe no test modules use files in src/port ... >kwargs: pg_test_mod_args, > ) > test_install_libs += test_radixtree > > But I'm not sure it's the right fix especially because I guess it > could raise "AddressSanitizer: odr-violation" error on Windows. Well, it's now at zero definitions that it can see, so I imagine it's possible that adding the above would not cause more than one. In any case, we might not know since as far as I can tell the MSVC animals don't have address sanitizer. I'll look around some more, and if I don't get any revelations, I guess we should go with the above.
Re: Patch: Add parse_type Function
Hello Hackers, On Feb 25, 2024, at 13:00, David E. Wheeler wrote: >> postgres=# SELECT to_regtypemod('timestamp(-4)'); >> ERROR: syntax error at or near "-" >> LINE 1: SELECT to_regtypemod('timestamp(-4)'); >> ^ >> CONTEXT: invalid type name "timestamp(-4)" >> >> postgres=# SELECT to_regtypemod('text(-4)'); >> ERROR: type modifier is not allowed for type "text" > > Yeah, there was quite a bit of discussion of this issue back in September[1]. > >> This behaviour is mentioned in the documentation, so I'd say it is ok. > > This is my attempt to make it clearer that it can return an error, but I > don’t love the wording TBH. I’ve rebased the patch and, in an attempt to clarify this behavior, added a couple of examples to the docs for to_regtype. Updated patch attached. Best, David v10-0001-Add-to_regtypemod-SQL-function.patch Description: Binary data
Re: ALTER TABLE SET ACCESS METHOD on partitioned tables
On Mon, Mar 04, 2024 at 05:46:56PM +0900, Michael Paquier wrote: > > Since InvalidOid is already taken, I guess you might need to introduce a > > boolean flag, like set_relam, indicating that the statement has an > > ACCESS METHOD clause. > > Yes, I don't see an alternative. The default needs a different field > to be tracked down to the execution. The data structure you used (defaultAccessMethod) allows this, which is intended to be prohibited: postgres=# ALTER TABLE a SET access method default, SET access method default; ALTER TABLE As you wrote it, you pass the "defaultAccessMethod" bool to ATExecSetAccessMethodNoStorage(), which seems odd. Why not just pass the target amoid as newAccessMethod ? When I fooled with this on my side, I called it "chgAccessMethod" to follow "chgPersistence". I think "is default" isn't the right data structure. Attached a relative patch with my version. Also: I just realized that instead of adding a bool, we could test (tab->rewrite & AT_REWRITE_ACCESS_METHOD) != 0 +-- Default and AM set in in clause are the same, relam should be set. in in? -- Justin >From 8aca5e443ebb41b7de1ba18b519a33a97a41a897 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Mon, 4 Mar 2024 17:42:04 +0900 Subject: [PATCH 1/2] Allow specifying access method of partitioned tables ..to be inherited by partitions See also: ca4103025dfe26eaaf6a500dec9170fbb176eebc 8586bf7ed8889f39a59dd99b292014b73be85342 ebfe2dbd6b624e2a428e14b7ee9322cc096f63f7 - prevent DROP AM Authors: Justin Pryzby, Soumyadeep Chakraborty --- doc/src/sgml/catalogs.sgml | 5 +- doc/src/sgml/ref/alter_table.sgml | 8 ++ doc/src/sgml/ref/create_table.sgml | 4 + src/backend/commands/tablecmds.c| 183 +--- src/backend/utils/cache/lsyscache.c | 22 +++ src/backend/utils/cache/relcache.c | 7 + src/bin/pg_dump/pg_dump.c | 3 +- src/bin/pg_dump/t/002_pg_dump.pl| 35 + src/include/catalog/pg_class.h | 4 +- src/include/utils/lsyscache.h | 1 + src/test/regress/expected/create_am.out | 158 +++- src/test/regress/sql/create_am.sql | 91 +++- 12 files changed, 487 insertions(+), 34 deletions(-) diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index 0ae97d1adaa..e4fcacb610a 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -1989,8 +1989,9 @@ SCRAM-SHA-256$:&l If this is a table or an index, the access method used (heap, - B-tree, hash, etc.); otherwise zero (zero occurs for sequences, - as well as relations without storage, such as views) + B-tree, hash, etc.); otherwise zero. Zero occurs for sequences, + as well as relations without storage, such as views. Partitioned + tables may use a non-zero value. diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml index 96e3d776051..779c8b31226 100644 --- a/doc/src/sgml/ref/alter_table.sgml +++ b/doc/src/sgml/ref/alter_table.sgml @@ -737,6 +737,14 @@ WITH ( MODULUS numeric_literal, REM DEFAULT changes the access method of the table to . + + When applied to a partitioned table, there is no data to rewrite, but any + partitions created afterwards will use that access method unless + overridden by a USING clause. Using + DEFAULT switches the partitioned table to rely on + the value of default_table_access_method set + when new partitions are created, instead. + diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml index 4cbaaccaf7c..b79081a5ec1 100644 --- a/doc/src/sgml/ref/create_table.sgml +++ b/doc/src/sgml/ref/create_table.sgml @@ -1330,6 +1330,10 @@ WITH ( MODULUS numeric_literal, REM method is chosen for the new table. See for more information. + + When creating a partition, the table access method is the access method + of its partitioned table, if set. + diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 7014be80396..cf595329b6c 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -184,6 +184,7 @@ typedef struct AlteredTableInfo bool verify_new_notnull; /* T if we should recheck NOT NULL */ int rewrite; /* Reason for forced rewrite, if any */ Oid newAccessMethod; /* new access method; 0 means no change */ + bool defaultAccessMethod; /* true if SET ACCESS METHOD DEFAULT */ Oid newTableSpace; /* new tablespace; 0 means no change */ bool chgPersistence; /* T if SET LOGGED/UNLOGGED is used */ char newrelpersistence; /* if above is true */ @@ -589,6 +590,8 @@ static ObjectAddress ATExecClusterOn(Relation rel, const char *indexName, LOCKMODE lockmode); static void ATExecDropCluster(Relation rel, LOCKMODE lockmode); static v
Re: Invalid query generated by postgres_fdw with UNION ALL and ORDER BY
On Fri, 8 Mar 2024 at 00:54, Ashutosh Bapat wrote: > > On Thu, Mar 7, 2024 at 4:39 PM David Rowley wrote: >> I think the fix should go in appendOrderByClause(). It's at that >> point we look for the EquivalenceMember for the relation and can >> easily discover if the em_expr is a Const. I think we can safely just >> skip doing any ORDER BY stuff and not worry about if the >> literal format of the const will appear as a reference to an ordinal >> column position in the ORDER BY clause. > > deparseSortGroupClause() calls deparseConst() with showtype = 1. > appendOrderByClause() may want to do something similar for consistency. Or > remove it from deparseSortGroupClause() as well? The fix could also be to use deparseConst() in appendOrderByClause() and have that handle Const EquivalenceMember instead. I'd rather just skip them. To me, that seems less risky than ensuring deparseConst() handles all Const types correctly. Also, as far as adjusting GROUP BY to eliminate Consts, I don't think that's material for a bug fix. If we want to consider doing that, that's for master only. >> I wonder if we need a test... > > Yes. I've added two of those in the attached. I also changed the way the delimiter stuff works as the exiting code seems to want to avoid having a bool flag to record if we're adding the first item. The change I'm making means the bool flag is now required, so we may as well use that flag to deal with the delimiter append too. David diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c index 8fc66fa11c..a30f00179e 100644 --- a/contrib/postgres_fdw/deparse.c +++ b/contrib/postgres_fdw/deparse.c @@ -3912,13 +3912,12 @@ appendOrderByClause(List *pathkeys, bool has_final_sort, { ListCell *lcell; int nestlevel; - const char *delim = " "; StringInfo buf = context->buf; + boolgotone = false; /* Make sure any constants in the exprs are printed portably */ nestlevel = set_transmission_modes(); - appendStringInfoString(buf, " ORDER BY"); foreach(lcell, pathkeys) { PathKey*pathkey = lfirst(lcell); @@ -3951,6 +3950,25 @@ appendOrderByClause(List *pathkeys, bool has_final_sort, em_expr = em->em_expr; + /* +* If the member is a Const expression then we needn't add it to the +* ORDER BY clause. This can happen in UNION ALL queries where the +* union child targetlist has a Const. Adding these would be wasteful, +* but also, for INT columns, an integer literal will be seen as an +* ordinal column position rather than a value to sort by, so we must +* skip these. +*/ + if (IsA(em_expr, Const)) + continue; + + if (!gotone) + { + appendStringInfoString(buf, " ORDER BY "); + gotone = true; + } + else + appendStringInfoString(buf, ", "); + /* * Lookup the operator corresponding to the strategy in the opclass. * The datatype used by the opfamily is not necessarily the same as @@ -3965,7 +3983,6 @@ appendOrderByClause(List *pathkeys, bool has_final_sort, pathkey->pk_strategy, em->em_datatype, em->em_datatype, pathkey->pk_opfamily); - appendStringInfoString(buf, delim); deparseExpr(em_expr, context); /* @@ -3975,7 +3992,6 @@ appendOrderByClause(List *pathkeys, bool has_final_sort, appendOrderBySuffix(oprid, exprType((Node *) em_expr), pathkey->pk_nulls_first, context); - delim = ", "; } reset_transmission_modes(nestlevel); } diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out index c355e8f3f7..f3339b4ac8 100644 --- a/contrib/postgres_fdw/expected/postgres_fdw.out +++ b/contrib/postgres_fdw/expected/postgres_fdw.out @@ -919,6 +919,44 @@ EXPLAIN (VERBOSE, COSTS OFF) Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8 FROM "S 1"."T 1" (6 rows) +-- Ensure we don't push ORDER BY expressions which are Consts at the UNION +-- child level to the foreign server. +EXPLAIN (VERBOSE, COSTS OFF) +SELECT * FROM ( +SELECT 1 AS type,c1 FROM ft1 +UNION ALL +SELECT 2 AS type,c1 FROM ft2 +) ORDER BY type,c1; + QUERY PLAN +- + Merge Append + Sort Key: (1), ft1.c1 + -> Foreign Scan on public.ft1 + Output: 1, ft1.c1 + Remote SQL: SELECT "C 1" FROM "
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Mar 8, 2024 at 8:09 AM Masahiko Sawada wrote: > Yesterday I've confirmed the something like the below fixes the > problem happened in Windows CI: > > --- a/src/test/modules/test_radixtree/meson.build > +++ b/src/test/modules/test_radixtree/meson.build > @@ -12,6 +12,7 @@ endif > > test_radixtree = shared_module('test_radixtree', >test_radixtree_sources, > + link_with: host_system == 'windows' ? pgport_srv : [], >kwargs: pg_test_mod_args, > ) > test_install_libs += test_radixtree pgport_srv is for backend, shared libraries should be using pgport_shlib Further, the top level meson.build has: # all shared libraries not part of the backend should depend on this frontend_shlib_code = declare_dependency( include_directories: [postgres_inc], link_with: [common_shlib, pgport_shlib], sources: generated_headers, dependencies: [shlib_code, os_deps, libintl], ) ...but the only things that declare needing frontend_shlib_code are in src/interfaces/. In any case, I'm trying it in CI branch with pgport_shlib now.
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Mar 8, 2024 at 9:53 AM John Naylor wrote: > > On Fri, Mar 8, 2024 at 8:09 AM Masahiko Sawada wrote: > > Yesterday I've confirmed the something like the below fixes the > > problem happened in Windows CI: > > > > --- a/src/test/modules/test_radixtree/meson.build > > +++ b/src/test/modules/test_radixtree/meson.build > > @@ -12,6 +12,7 @@ endif > > > > test_radixtree = shared_module('test_radixtree', > >test_radixtree_sources, > > + link_with: host_system == 'windows' ? pgport_srv : [], > >kwargs: pg_test_mod_args, > > ) > > test_install_libs += test_radixtree > > pgport_srv is for backend, shared libraries should be using pgport_shlib > In any case, I'm trying it in CI branch with pgport_shlib now. That seems to work, so I'll push that just to get things green again.
Re: Properly pathify the union planner
On Fri, 8 Mar 2024 at 00:39, Richard Guo wrote: > I would like to have another look, but it might take several days. > Would that be too late? Please look. Several days is fine. I'd like to start looking on Monday or Tuesday next week. Thanks David
Re: Synchronizing slots from primary to standby
On Thu, Mar 7, 2024 at 12:00 PM Zhijie Hou (Fujitsu) wrote: > > > Attach the V108 patch set which addressed above and Peter's comments. > I also removed the check for "*" in guc check hook. > Pushed with minor modifications. I'll keep an eye on BF. BTW, one thing that we should try to evaluate a bit more is the traversal of slots in StandbySlotsHaveCaughtup() where we verify if all the slots mentioned in standby_slot_names have received the required WAL. Even if the standby_slot_names list is short the total number of slots can be much larger which can lead to an increase in CPU usage during traversal. There is an optimization that allows to cache ss_oldest_flush_lsn and ensures that we don't need to traverse the slots each time so it may not hit frequently but still there is a chance. I see it is possible to further optimize this area by caching the position of each slot mentioned in standby_slot_names in replication_slots array but not sure whether it is worth. -- With Regards, Amit Kapila.
Re: WIP Incremental JSON Parser
On 2024-03-07 Th 10:28, Jacob Champion wrote: Some more observations as I make my way through the patch: In src/common/jsonapi.c, +#define JSON_NUM_NONTERMINALS 6 Should this be 5 now? Yep. + res = pg_parse_json_incremental(&(incstate->lex), &(incstate->sem), + chunk, size, is_last); + + expected = is_last ? JSON_SUCCESS : JSON_INCOMPLETE; + + if (res != expected) + json_manifest_parse_failure(context, "parsing failed"); This leads to error messages like pg_verifybackup: error: could not parse backup manifest: parsing failed which I would imagine is going to lead to confused support requests in the event that someone does manage to corrupt their manifest. Can we make use of json_errdetail() and print the line and column numbers? Patch 0001 over at [1] has one approach to making json_errdetail() workable in frontend code. Looks sound on a first look. Maybe we should get that pushed ASAP so we can take advantage of it. Top-level scalars like `false` or `12345` do not parse correctly if the chunk size is too small; instead json_errdetail() reports 'Token "" is invalid'. With small chunk sizes, json_errdetail() additionally segfaults on constructions like `[tru]` or `12zz`. Ugh. Will investigate. For my local testing, I'm carrying the following diff in 001_test_json_parser_incremental.pl: - ok($stdout =~ /SUCCESS/, "chunk size $size: test succeeds"); - ok(!$stderr, "chunk size $size: no error output"); + like($stdout, qr/SUCCESS/, "chunk size $size: test succeeds"); + is($stderr, "", "chunk size $size: no error output"); This is particularly helpful when a test fails spuriously due to code coverage spray on stderr. Makes sense, thanks. I'll have a fresh patch set soon which will also take care of the bitrot. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Improve eviction algorithm in ReorderBuffer
On Thu, Mar 7, 2024 at 2:16 PM Masahiko Sawada wrote: > > On Tue, Mar 5, 2024 at 3:28 PM Peter Smith wrote: > > > > 4a. > > The comment in simplehash.h says > > * The following parameters are only relevant when SH_DEFINE is defined: > > * - SH_KEY - ... > > * - SH_EQUAL(table, a, b) - ... > > * - SH_HASH_KEY(table, key) - ... > > * - SH_STORE_HASH - ... > > * - SH_GET_HASH(tb, a) - ... > > > > So maybe it is nicer to reorder the #defines in that same order? > > > > SUGGESTION: > > +#define SH_PREFIX bh_nodeidx > > +#define SH_ELEMENT_TYPE bh_nodeidx_entry > > +#define SH_KEY_TYPE bh_node_type > > +#define SH_SCOPE extern > > +#ifdef FRONTEND > > +#define SH_RAW_ALLOCATOR pg_malloc0 > > +#endif > > > > +#define SH_DEFINE > > +#define SH_KEY key > > +#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(bh_node_type)) == 0) > > +#define SH_HASH_KEY(tb, key) \ > > + hash_bytes((const unsigned char *) &key, sizeof(bh_node_type)) > > +#include "lib/simplehash.h" > > I'm really not sure it helps increase readability. For instance, for > me it's readable if SH_DEFINE and SH_DECLARE come to the last before > #include since it's more obvious whether we want to declare, define or > both. Other simplehash.h users also do so. > OK. > > 5. > > + * > > + * If 'indexed' is true, we create a hash table to track of each node's > > + * index in the heap, enabling to perform some operations such as removing > > + * the node from the heap. > > */ > > binaryheap * > > -binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg) > > +binaryheap_allocate(int capacity, binaryheap_comparator compare, > > + bool indexed, void *arg) > > > > BEFORE > > ... enabling to perform some operations such as removing the node from the > > heap. > > > > SUGGESTION > > ... to help make operations such as removing nodes more efficient. > > > > But these operations literally require the indexed binary heap as we > have an assertion: > > void > binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d) > { > bh_nodeidx_entry *ent; > > Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property); > Assert(heap->bh_indexed); > I didn’t quite understand -- the operations mentioned are "operations such as removing the node", but binaryheap_remove_node() also removes a node from the heap. So I still felt the comment wording of the patch is not quite correct. Now, if the removal of a node from an indexed heap can *only* be done using binaryheap_remove_node_ptr() then: - the other removal functions (binaryheap_remove_*) probably need some comments to make sure nobody is tempted to call them directly for an indexed heap. - maybe some refactoring and assertions are needed to ensure those *cannot* be called directly for an indexed heap. > > 7b. > > IMO the parameters would be better the other way around (e.g. 'index' > > before the 'node') because that's what the assignments look like: > > > > > > heap->bh_nodes[heap->bh_size] = d; > > > > becomes: > > bh_set_node(heap, heap->bh_size, d); > > > > I think it assumes heap->bh_nodes is an array. But if we change it in > the future, it will no longer make sense. I think it would make more > sense if we define the parameters in an order like "we set the 'node' > at 'index'". What do you think? YMMV. The patch code is also OK by me if you prefer it. // And, here are some review comments for v8-0002. == 1. delete_nodeidx +/* + * Remove the node's index from the hash table if the heap is indexed. + */ +static bool +delete_nodeidx(binaryheap *heap, bh_node_type node) +{ + if (!binaryheap_indexed(heap)) + return false; + + return bh_nodeidx_delete(heap->bh_nodeidx, node); +} 1a. In v8 this function was changed to now return bool, so, I think the function comment should explain the meaning of that return value. ~ 1b. I felt the function body is better expressed positively: "If this then do that", instead of "If not this then do nothing otherwise do that" SUGGESTION if (binaryheap_indexed(heap)) return bh_nodeidx_delete(heap->bh_nodeidx, node); return false; ~~~ 2. +static void +replace_node(binaryheap *heap, int index, bh_node_type new_node) +{ + bool found PG_USED_FOR_ASSERTS_ONLY; + + /* Quick return if not necessary to move */ + if (heap->bh_nodes[index] == new_node) + return; + + /* + * Remove overwritten node's index. The overwritten node's position must + * have been tracked, if enabled. + */ + found = delete_nodeidx(heap, heap->bh_nodes[index]); + Assert(!binaryheap_indexed(heap) || found); + + /* + * Replace it with the given new node. This node's position must also be + * tracked as we assume to replace the node by the existing node. + */ + found = set_node(heap, new_node, index); + Assert(!binaryheap_indexed(heap) || found); +} 2a. /Remove overwritten/Remove the overwritten/ /replace the node by the existing node/replace the node with the existing node/ ~ 2b. It might be helpful to declare another local var... bh_node_type
Re: remaining sql/json patches
I looked at the documentation again. one more changes for JSON_QUERY: diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 3e58ebd2..0c49b321 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -18715,8 +18715,8 @@ ERROR: jsonpath array subscript is out of bounds be of type jsonb. -The ON EMPTY clause specifies the behavior if the -path_expression yields no value at all; the +The ON EMPTY clause specifies the behavior if applying the +path_expression to the context_item yields no value at all; the default when ON EMPTY is not specified is to return a null value.
Re: CREATE TABLE creates a composite type corresponding to the table row, which is and is not there
On 2024-03-08 01:12 +0100, Hannu Krosing wrote: > I could not find any explanation of the following behaviour in docs - > > > Our documentation for CREATE TABLE says: > > CREATE TABLE also automatically creates a data type that represents > the composite type corresponding to one row of the table. Therefore, > tables cannot have the same name as any existing data type in the same > schema. > > But these composite tables are only sometimes there There's a distinction between stand-alone composite types created with CREATE TYPE and those created implicitly via CREATE TABLE. The former is also called "free-standing" in the docs for pg_type.typrelid[1]. > hannuk=# CREATE TABLE pair(a int, b int); > CREATE TABLE > > hannuk=# INSERT INTO pair VALUES(1,2); > INSERT 0 1 > > hannuk=# select pg_typeof(p) from pair as p; > pg_typeof > --- > pair > > hannuk=# select pg_typeof(pg_typeof(p)) from pair as p; > pg_typeof > --- > regtype > > # first case where I can not use the table-defined type > > hannuk=# create table anoter_pair of pair; > ERROR: type pair is not a composite type That error message is simply misleading. What gets checked here is that type "pair" was created with CREATE TYPE. The attached patch fixes the error message and also documents that requirement. check_of_type() already addresses this limitation: /* * check_of_type * * Check whether a type is suitable for CREATE TABLE OF/ALTER TABLE OF. If it * isn't suitable, throw an error. Currently, we require that the type * originated with CREATE TYPE AS. We could support any row type, but doing so * would require handling a number of extra corner cases in the DDL commands. * (Also, allowing domain-over-composite would open up a can of worms about * whether and how the domain's constraints should apply to derived tables.) */ Not sure what those corner cases are, but table inheritance is one of them: I played around with typeOk in check_of_type() to also accept the composite types implicitly created by CREATE TABLE: typeOk = (typeRelation->rd_rel->relkind == RELKIND_COMPOSITE_TYPE || typeRelation->rd_rel->relkind == RELKIND_RELATION); With that creating typed tables of parent and child works as expected: CREATE TABLE parent (a int); CREATE TABLE child (b int) INHERITS (parent); CREATE TABLE of_parent OF parent; CREATE TABLE of_child OF child; \d parent Table "public.parent" Column | Type | Collation | Nullable | Default +-+---+--+- a | integer | | | Number of child tables: 1 (Use \d+ to list them.) \d of_parent Table "public.of_parent" Column | Type | Collation | Nullable | Default +-+---+--+- a | integer | | | Typed table of type: parent \d child Table "public.child" Column | Type | Collation | Nullable | Default +-+---+--+- a | integer | | | b | integer | | | Inherits: parent \d of_child Table "public.of_child" Column | Type | Collation | Nullable | Default +-+---+--+- a | integer | | | b | integer | | | Typed table of type: child But adding columns to parent does not change the typed tables: ALTER TABLE parent ADD c int; \d parent Table "public.parent" Column | Type | Collation | Nullable | Default +-+---+--+- a | integer | | | c | integer | | | Number of child tables: 1 (Use \d+ to list them.) \d of_parent Table "public.of_parent" Column | Type | Collation | Nullable | Default +-+---+--+- a | integer | | | Typed table of type: parent \d child Table "public.child" Column | Type | Collation | Nullable | Default +-+---+--+- a | integer | | | b | integer | | | c | integer | | | Inherits: parent \d of_child Table "public.of_child" Column | Type | Collation | Nullable | Default
Re: CREATE TABLE creates a composite type corresponding to the table row, which is and is not there
I wrote: > The attached patch fixes the error message and also documents that > requirement. On second thought, adding a separate error message doesn't really make sense. The attached v2 is a simpler patch that instead modifies the existing error message. -- Erik
Re: Synchronizing slots from primary to standby
On Fri, Mar 8, 2024 at 2:33 PM Amit Kapila wrote: > On Thu, Mar 7, 2024 at 12:00 PM Zhijie Hou (Fujitsu) > wrote: > > > > > > Attach the V108 patch set which addressed above and Peter's comments. > > I also removed the check for "*" in guc check hook. > > > > > Pushed with minor modifications. I'll keep an eye on BF. > > BTW, one thing that we should try to evaluate a bit more is the > traversal of slots in StandbySlotsHaveCaughtup() where we verify if > all the slots mentioned in standby_slot_names have received the > required WAL. Even if the standby_slot_names list is short the total > number of slots can be much larger which can lead to an increase in > CPU usage during traversal. There is an optimization that allows to > cache ss_oldest_flush_lsn and ensures that we don't need to traverse > the slots each time so it may not hit frequently but still there is a > chance. I see it is possible to further optimize this area by caching > the position of each slot mentioned in standby_slot_names in > replication_slots array but not sure whether it is worth. > > > I tried to test this by configuring a large number of logical slots while making sure the standby slots are at the end of the array and checking if there was any performance hit in logical replication from these searches. Setup: 1. 1 primary server configured with 3 servers in the standby_slot_names, 1 extra logical slot (not configured for failover) + 1 logical subscriber configures as failover + 3 physical standbys(all configured to sync logical slots) 2. 1 primary server configured with 3 servers in the standby_slot_names, 100 extra logical slot (not configured for failover) + 1 logical subscriber configures as failover + 3 physical standbys(all configured to sync logical slots) 3. 1 primary server configured with 3 servers in the standby_slot_names, 500 extra logical slot (not configured for failover) + 1 logical subscriber configures as failover + 3 physical standbys(all configured to sync logical slots) In the three setups, 3 standby_slot_names are compared with a list of 2,101 and 501 slots respectively. I ran a pgbench for 15 minutes for all 3 setups: Case 1: Average TPS - 8.143399 TPS Case 2: Average TPS - 8.187462 TPS Case 3: Average TPS - 8.190611 TPS I see no degradation in the performance, the differences in performance are well within the run to run variations seen. Nisha also did some performance tests to record the lag introduced by the large number of slots traversal in StandbySlotsHaveCaughtup(). The tests logged time at the start and end of the XLogSendLogical() call (which eventually calls WalSndWaitForWal() --> StandbySlotsHaveCaughtup()) and calculated total time taken by this function during the load run for different total slots count. Setup: --one primary with 3 standbys and one subscriber with one active subscription --hot_standby_feedback=off and sync_replication_slots=false --made sure the standby slots remain at the end ReplicationSlotCtl->replication_slots array to measure performance of worst case scenario for standby slot search in StandbySlotsHaveCaughtup() pgbench for 15 min was run. Here is the data: Case1 : with 1 logical slot, standby_slot_names having 3 slots Run1: 626.141642 secs Run2: 631.930254 secs Case2 : with 100 logical slots, standby_slot_names having 3 slots Run1: 629.38332 secs Run2: 630.548432 secs Case3 : with 500 logical slots, standby_slot_names having 3 slots Run1: 629.910829 secs Run2: 627.924183 secs There was no degradation in performance seen. Thanks Nisha for helping with the testing. regards, Ajin Cherian Fujitsu Australia
Re: CREATE TABLE creates a composite type corresponding to the table row, which is and is not there
I wrote: > The attached v2 is a simpler patch that instead modifies the existing > error message. Forgot to attach v2. -- Erik >From 9ab6b625e8f93ae2957144ec7f0ba32cf8a3bb5b Mon Sep 17 00:00:00 2001 From: Erik Wienhold Date: Fri, 8 Mar 2024 04:21:56 +0100 Subject: [PATCH v2] Document that typed tables rely on CREATE TYPE CREATE TABLE OF accepts only composite types that were created with CREATE TYPE. Clarify that also in the error message. --- doc/src/sgml/ref/create_table.sgml| 2 ++ src/backend/commands/tablecmds.c | 2 +- src/test/regress/expected/typed_table.out | 2 +- 3 files changed, 4 insertions(+), 2 deletions(-) diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml index 4cbaaccaf7..889496cf0a 100644 --- a/doc/src/sgml/ref/create_table.sgml +++ b/doc/src/sgml/ref/create_table.sgml @@ -254,6 +254,8 @@ WITH ( MODULUS numeric_literal, REM schema-qualified). A typed table is tied to its type; for example the table will be dropped if the type is dropped (with DROP TYPE ... CASCADE). + Expects the composite type to have been created with + . diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 7014be8039..0f43f349dc 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -6979,7 +6979,7 @@ check_of_type(HeapTuple typetuple) if (!typeOk) ereport(ERROR, (errcode(ERRCODE_WRONG_OBJECT_TYPE), -errmsg("type %s is not a composite type", +errmsg("type %s is not a composite type created with CREATE TYPE", format_type_be(typ->oid; } diff --git a/src/test/regress/expected/typed_table.out b/src/test/regress/expected/typed_table.out index 2e47ecbcf5..9edd744f6a 100644 --- a/src/test/regress/expected/typed_table.out +++ b/src/test/regress/expected/typed_table.out @@ -89,7 +89,7 @@ drop cascades to function get_all_persons() drop cascades to table persons2 drop cascades to table persons3 CREATE TABLE persons5 OF stuff; -- only CREATE TYPE AS types may be used -ERROR: type stuff is not a composite type +ERROR: type stuff is not a composite type created with CREATE TYPE DROP TABLE stuff; -- implicit casting CREATE TYPE person_type AS (id int, name text); -- 2.44.0
Re: ALTER TABLE SET ACCESS METHOD on partitioned tables
On Thu, Mar 07, 2024 at 08:02:00PM -0600, Justin Pryzby wrote: > As you wrote it, you pass the "defaultAccessMethod" bool to > ATExecSetAccessMethodNoStorage(), which seems odd. Why not just pass > the target amoid as newAccessMethod ? + /* +* Check that the table access method exists. +* Use the access method, if specified, otherwise (when not specified) 0 +* for partitioned tables or the configured default AM. +*/ + if (amname != NULL) + amoid = get_table_am_oid(amname, false); + else if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) + amoid = 0; + else + amoid = get_table_am_oid(default_table_access_method, false); .. While using this flow to choose the AM oid, that's neater than the versions I could come up with, pretty cool. > When I fooled with this on my side, I called it "chgAccessMethod" to > follow "chgPersistence". I think "is default" isn't the right data > structure. > > Attached a relative patch with my version. Thanks. I've applied the patch to add the DEFAULT clause for now, to ease the work on this patch. > Also: I just realized that instead of adding a bool, we could test > (tab->rewrite & AT_REWRITE_ACCESS_METHOD) != 0 Hmm. I've considered that, but the boolean style is able to do the work, while being more consistent, so I'm OK with what you are doing in your 0002. > +-- Default and AM set in in clause are the same, relam should be set. > > in in? Oops, fixed. I have spent more time reviewing the whole and the tests (I didn't see much value in testing the DEFAULT clause twice for the partitioned table case and there is a test in d61a6cad6418), tweaked a few comments and the documentation, did an indentation and a commit message draft. How does that look to you? The test coverage and the semantics do what we want them to do, so that looks rather reasonable here. A second or even third pair of eyes would not hurt. -- Michael From 4114c7944c78dbb6eb85320434d942daa77fa4d8 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Fri, 8 Mar 2024 13:24:14 +0900 Subject: [PATCH v4] Allow specifying access method for partitioned tables Partitioned tables gain support for CREATE TABLE .. USING, where specifying an access method can serve as a hint when creating partitions on it. This includes support for ALTER TABLE .. SET ACCESS METHOD, where specifying DEFAULT is equivalent to resetting the partitioned table's relam to 0. Specifying a non-DEFAULT access method will set its relam to not be 0. The historical default of using pg_class.relam as InvalidOid is preserved, corresponding to the case where USING clause is not used. In this case, like previous releases, a partition is created with the access method set by default_table_access_method if its CREATE TABLE query does not specify an access method with a USING clause. The relcache of partitioned tables is not changed: rd_tableam is not set, even if a partitioned table has a relam set. Authors: Justin Pryzby, Soumyadeep Chakraborty Reviewed-by: Michael Paquier, and a few others.. Discussion: https://postgr.es/m/ --- src/include/catalog/pg_class.h | 4 +- src/include/utils/lsyscache.h | 1 + src/backend/commands/tablecmds.c| 172 src/backend/utils/cache/lsyscache.c | 22 +++ src/backend/utils/cache/relcache.c | 7 + src/bin/pg_dump/pg_dump.c | 3 +- src/bin/pg_dump/t/002_pg_dump.pl| 35 + src/test/regress/expected/create_am.out | 158 +- src/test/regress/sql/create_am.sql | 91 - doc/src/sgml/catalogs.sgml | 5 +- doc/src/sgml/ref/alter_table.sgml | 8 ++ doc/src/sgml/ref/create_table.sgml | 4 + 12 files changed, 471 insertions(+), 39 deletions(-) diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h index 3b7533e7bb..0fc2c093b0 100644 --- a/src/include/catalog/pg_class.h +++ b/src/include/catalog/pg_class.h @@ -219,7 +219,9 @@ MAKE_SYSCACHE(RELNAMENSP, pg_class_relname_nsp_index, 128); /* * Relation kinds with a table access method (rd_tableam). Although sequences * use the heap table AM, they are enough of a special case in most uses that - * they are not included here. + * they are not included here. Likewise, partitioned tables can have an access + * method defined so that their partitions can inherit it, but they do not set + * rd_tableam; hence, this is handled specially outside of this macro. */ #define RELKIND_HAS_TABLE_AM(relkind) \ ((relkind) == RELKIND_RELATION || \ diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h index e4a200b00e..35a8dec2b9 100644 --- a/src/include/utils/lsyscache.h +++ b/src/include/utils/lsyscache.h @@ -139,6 +139,7 @@ extern char get_rel_relkind(Oid relid); extern bool get_rel_relispartition(Oid relid); extern Oid get_rel_tablespace(Oid relid); extern char get_re
Re: Patch: Add parse_type Function
Hi David, On 2024-03-08 02:37 +0100, David E. Wheeler wrote: > I’ve rebased the patch and, in an attempt to clarify this behavior, > added a couple of examples to the docs for to_regtype. Updated patch > attached. On your latest addition: > +). Failure to extract a valid potential > +type name results in an error. For example: > + > +SELECT to_regtype('party'); > + to_regtype > + > + > + > +However, if the extracted name is not known to the system, this > function > +will return NULL. For example: > + > +SELECT to_regtype('interval nonesuch'); > +ERROR: syntax error at or near "nonesuch" > +LINE 1: select to_regtype('interval nonesuch'); > + ^ > +CONTEXT: invalid type name "interval nonesuch" > + I think you need to swap the examples. The text mentions the error case first and the NULL case second. -- Erik
Re: Synchronizing slots from primary to standby
On Fri, Mar 8, 2024 at 9:56 AM Ajin Cherian wrote: > >> Pushed with minor modifications. I'll keep an eye on BF. >> >> BTW, one thing that we should try to evaluate a bit more is the >> traversal of slots in StandbySlotsHaveCaughtup() where we verify if >> all the slots mentioned in standby_slot_names have received the >> required WAL. Even if the standby_slot_names list is short the total >> number of slots can be much larger which can lead to an increase in >> CPU usage during traversal. There is an optimization that allows to >> cache ss_oldest_flush_lsn and ensures that we don't need to traverse >> the slots each time so it may not hit frequently but still there is a >> chance. I see it is possible to further optimize this area by caching >> the position of each slot mentioned in standby_slot_names in >> replication_slots array but not sure whether it is worth. >> >> > > I tried to test this by configuring a large number of logical slots while > making sure the standby slots are at the end of the array and checking if > there was any performance hit in logical replication from these searches. > Thanks Ajin and Nisha. We also plan: 1) Redoing XLogSendLogical time-log related test with 'sync_replication_slots' enabled. 2) pg_recvlogical test to monitor lag in StandbySlotsHaveCaughtup() for a large number of slots. 3) Profiling to see if StandbySlotsHaveCaughtup() is noticeable in the report when there are a large number of slots to traverse. thanks Shveta
Re: Missing LWLock protection in pgstat_reset_replslot()
On Thu, Mar 07, 2024 at 11:30:55AM +0530, shveta malik wrote: > It slightly improves the chances. pgstat_fetch_replslot is only > called from pg_stat_get_replication_slot(), I thought it might be > better to acquire lock before we call pgstat_fetch_replslot and > release once we are done copying the results for that particular slot. > But I also understand that it will not prevent someone from dropping > that slot at a later stage when the rest of the calls of > pg_stat_get_replication_slot() are still pending. I doubt that there will be more callers of pgstat_fetch_replslot() in the near future, but at least we would be a bit safer with these internals IDs when manipulating the slots, when considered in isolation of this API call > So I am okay with the current one. Okay, noted. Let's give a couple of days to others, in case there are more comments. (Patch looked OK here after a second look this afternoon.) -- Michael signature.asc Description: PGP signature