Re: Custom reloptions and lock modes
On Fri, Sep 20, 2019 at 11:59:13AM +0530, Kuntal Ghosh wrote: > On Fri, Sep 20, 2019 at 7:08 AM Michael Paquier wrote: >> Hence attached is a patch set to address those issues: >> - 0001 makes sure that any existing module creating a custom reloption >> has the lock mode set to AccessExclusiveMode, which would be a sane >> default anyway. I think that we should just back-patch that and close >> any holes. > > Looks good to me. The patch solves the issue and passes with > regression tests. IMHO, it should be back-patched to all the branches. That's the plan but... >> - 0002 is a patch which we could use to extend the existing reloption >> APIs to set the lock mode used. I am aware of the recent work done by >> Nikolay in CC to rework this API set, but I am unsure where we are >> going there, and the resulting patch is actually very short, being >> 20-line long with the current infrastructure. That could go into >> HEAD. Table AMs have been added in v12 so custom reloptions could >> gain more in popularity, but as we are very close to the release it >> would not be cool to break those APIs. The patch simplicity could >> also be a reason sufficient for a back-patch, and I don't think that >> there are many users of them yet. >> > > I think this is good approach for now and can be committed on the > HEAD only. Let's wait a couple of days to see if others have any objections to offer on the matter. My plan would be to revisit this patch set after RC1 is tagged next week to at least fix the bug. I don't predict any strong objections to the patch for HEAD, but who knows.. > One small thing: > > add_int_reloption(bl_relopt_kind, buf, >"Number of bits generated for each index column", > - DEFAULT_BLOOM_BITS, 1, MAX_BLOOM_BITS); > + DEFAULT_BLOOM_BITS, 1, MAX_BLOOM_BITS, > + AccessExclusiveLock); > Do we need a comment to explain why we're using AccessExclusiveLock in > this case? Because that's the safest default to use here? That seemed obvious to me. -- Michael signature.asc Description: PGP signature
Re: Custom reloptions and lock modes
On Fri, Sep 20, 2019 at 12:38 PM Michael Paquier wrote: > > > One small thing: > > > > add_int_reloption(bl_relopt_kind, buf, > >"Number of bits generated for each index column", > > - DEFAULT_BLOOM_BITS, 1, MAX_BLOOM_BITS); > > + DEFAULT_BLOOM_BITS, 1, MAX_BLOOM_BITS, > > + AccessExclusiveLock); > > Do we need a comment to explain why we're using AccessExclusiveLock in > > this case? > > Because that's the safest default to use here? That seemed obvious to > me. Okay. Sounds good. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
Re: backup manifests
On Thu, Sep 19, 2019 at 11:10:46PM -0400, David Steele wrote: > On 9/19/19 11:00 AM, Robert Haas wrote: >> On Thu, Sep 19, 2019 at 9:51 AM Robert Haas wrote: >> > I intend that we should be able to support incremental backups based >> > either on a previous full backup or based on a previous incremental >> > backup. I am not aware of a technical reason why we need to identify >> > the specific backup that must be used. If incremental backup B is >> > taken based on a pre-existing backup A, then I think that B can be >> > restored using either A or *any other backup taken after A and before >> > B*. In the normal case, there probably wouldn't be any such backup, >> > but AFAICS the start-LSNs are a sufficient cross-check that the chosen >> > base backup is legal. >> >> Scratch that: there can be overlapping backups, so you have to >> cross-check both start and stop LSNs. > > Overall we have found it's much simpler to label each backup and cross-check > that against the pg version and system id. Start LSN is pretty unique, but > backup labels work really well and are more widely understood. Warning. The start LSN could be the same for multiple backups when taken from a standby. -- Michael signature.asc Description: PGP signature
Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)
The patch series failed on Windows CI. I modified the Windows build file to fix that. See attached for the v7 version. v7-0003-Ensure-target-clean-shutdown-at-the-beginning-of-.patch Description: Binary data v7-0001-Extact-common-recovery-related-functions-from-pg_.patch Description: Binary data v7-0002-Add-option-to-write-recovery-configuration-inform.patch Description: Binary data
Re: psql - add SHOW_ALL_RESULTS option
Should we add function header for the below function to maintain the common standard of this file: Yes. Attached v6 does that. -- Fabien.diff --git a/contrib/pg_stat_statements/expected/pg_stat_statements.out b/contrib/pg_stat_statements/expected/pg_stat_statements.out index 6787ec1efd..de59a5cf65 100644 --- a/contrib/pg_stat_statements/expected/pg_stat_statements.out +++ b/contrib/pg_stat_statements/expected/pg_stat_statements.out @@ -49,17 +49,42 @@ BEGIN \; SELECT 2.0 AS "float" \; SELECT 'world' AS "text" \; COMMIT; + float +--- + 2.0 +(1 row) + + text +--- + world +(1 row) + -- compound with empty statements and spurious leading spacing -\;\; SELECT 3 + 3 \;\;\; SELECT ' ' || ' !' \;\; SELECT 1 + 4 \;; - ?column? --- -5 +\;\; SELECT 3 + 3 AS "+" \;\;\; SELECT ' ' || ' !' AS "||" \;\; SELECT 1 + 4 AS "+" \;; + + +--- + 6 +(1 row) + + || +- + ! +(1 row) + + + +--- + 5 (1 row) -- non ;-terminated statements SELECT 1 + 1 + 1 AS "add" \gset SELECT :add + 1 + 1 AS "add" \; SELECT :add + 1 + 1 AS "add" \gset + add +- + 5 +(1 row) + -- set operator SELECT 1 AS i UNION SELECT 2 ORDER BY i; i @@ -102,12 +127,12 @@ SELECT query, calls, rows FROM pg_stat_statements ORDER BY query COLLATE "C"; SELECT $1+| 4 |4 +| | AS "text" | | - SELECT $1 + $2| 2 |2 SELECT $1 + $2 + $3 AS "add" | 3 |3 + SELECT $1 + $2 AS "+" | 2 |2 SELECT $1 AS "float" | 1 |1 SELECT $1 AS "int"| 2 |2 SELECT $1 AS i UNION SELECT $2 ORDER BY i | 1 |2 - SELECT $1 || $2 | 1 |1 + SELECT $1 || $2 AS "||" | 1 |1 SELECT pg_stat_statements_reset() | 1 |1 WITH t(f) AS ( +| 1 |2 VALUES ($1), ($2) +| | diff --git a/contrib/pg_stat_statements/sql/pg_stat_statements.sql b/contrib/pg_stat_statements/sql/pg_stat_statements.sql index 8b527070d4..ea3a31176e 100644 --- a/contrib/pg_stat_statements/sql/pg_stat_statements.sql +++ b/contrib/pg_stat_statements/sql/pg_stat_statements.sql @@ -27,7 +27,7 @@ SELECT 'world' AS "text" \; COMMIT; -- compound with empty statements and spurious leading spacing -\;\; SELECT 3 + 3 \;\;\; SELECT ' ' || ' !' \;\; SELECT 1 + 4 \;; +\;\; SELECT 3 + 3 AS "+" \;\;\; SELECT ' ' || ' !' AS "||" \;\; SELECT 1 + 4 AS "+" \;; -- non ;-terminated statements SELECT 1 + 1 + 1 AS "add" \gset diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml index 7789fc6177..4e6ab5a0a5 100644 --- a/doc/src/sgml/ref/psql-ref.sgml +++ b/doc/src/sgml/ref/psql-ref.sgml @@ -127,18 +127,11 @@ echo '\x \\ SELECT * FROM foo;' | psql commands included in the string to divide it into multiple transactions. (See for more details about how the server handles multi-query strings.) - Also, psql only prints the - result of the last SQL command in the string. - This is different from the behavior when the same string is read from - a file or fed to psql's standard input, - because then psql sends - each SQL command separately. - Because of this behavior, putting more than one SQL command in a - single -c string often has unexpected results. - It's better to use repeated -c commands or feed - multiple commands to psql's standard input, + If having several commands executed in one transaction is not desired, + use repeated -c commands or feed multiple commands to + psql's standard input, either using echo as illustrated above, or via a shell here-document, for example: @@ -3355,10 +3348,8 @@ select 1\; select 2\; select 3; commands included in the string to divide it into multiple transactions. (See for more details about how the server handles multi-query strings.) -psql prints only the last query result -it receives for each request; in this example, although all -three SELECTs are indeed executed, psql -only prints the 3. +psql prints all results it receives, one +after the other. diff --git a/src/bin/psql/common.c b/src/bin/psql/common.c index 4b2679360f..db6d031133 100644 --- a/src/bin/psql/common.c +++ b/src/bin/psql/common.c @@ -486,6 +486,19 @@ ResetCancelConn(void) #endif } +/* + * Show error message from result, if any, and check connection in passing. + */ +static void +ShowErrorMessage(const PGresult *result) +{ + const char *error = PQerrorMessage(pset.d
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 20/9/19 4:06, Michael Paquier wrote: On Thu, Sep 19, 2019 at 05:40:41PM +0300, Alexey Kondratov wrote: On 19.09.2019 16:21, Robert Haas wrote: So, earlier in this thread, I suggested making this part of ALTER TABLE, and several people seemed to like that idea. Did we have a reason for dropping that approach? Personally, I don't find this idea very attractive as ALTER TABLE is already complicated enough with all the subqueries we already support in the command, all the logic we need to maintain to make combinations of those subqueries in a minimum number of steps, and also the number of bugs we have seen because of the amount of complication present. Yes, but please keep the other options: At it is, cluster, vacuum full and reindex already rewrite the table in full; Being able to write the result to a different tablespace than the original object was stored in enables a whole world of very interesting possibilities including a quick way out of a "so little disk space available that vacuum won't work properly" situation --- which I'm sure MANY users will appreciate, including me If we add this option to REINDEX, then for 'ALTER TABLE tb_name action1, REINDEX SET TABLESPACE tbsp_name, action3' action2 will be just a direct alias to 'REINDEX TABLE tb_name SET TABLESPACE tbsp_name'. So it seems practical to do this for REINDEX first. The only one concern I have against adding REINDEX to ALTER TABLE in this context is that it will allow user to write such a chimera: ALTER TABLE tb_name REINDEX SET TABLESPACE tbsp_name, SET TABLESPACE tbsp_name; when they want to move both table and all the indexes. Because simple ALTER TABLE tb_name REINDEX, SET TABLESPACE tbsp_name; looks ambiguous. Should it change tablespace of table, indexes or both? Indeed. IMHO, that form of the command should not allow that much flexibility... even on the "principle of least surprise" grounds :S That is, I'd restrict the ability to change (output) tablespace to the "direct" form --- REINDEX name, VACUUM (FULL) name, CLUSTER name --- whereas the ALTER table|index SET TABLESPACE would continue to work. Now that I come to think of it, maybe saying "output" or "move to" rather than "set tablespace" would make more sense for this variation of the commands? (clearer, less prone to confusion)? Tricky question, but we don't change the tablespace of indexes when using an ALTER TABLE, so I would say no on compatibility grounds. ALTER TABLE has never touched the tablespace of indexes, and I don't think that we should begin to do so. Indeed. I might be missing something, but is there any reason to not *require* a explicit transaction for the above multi-action commands? I mean, have it be: BEGIN; ALTER TABLE tb_name SET TABLESPACE tbsp_name; -- moves the table but possibly NOT the indexes? ALTER TABLE tb_name REINDEX [OUTPUT TABLESPACE tbsp_name]; -- REINDEX, placing the resulting index on tbsp_name instead of the original one COMMIT; ... and have the parser/planner combine the steps if it'd make sense (it probably wouldn't in this example)? Just my .02€ Thanks, / J.L.
Re: A problem about partitionwise join
Hi Dilip, Thank you for reviewing this patch. On Fri, Sep 20, 2019 at 12:48 PM Dilip Kumar wrote: > On Thu, Aug 29, 2019 at 3:15 PM Richard Guo wrote: > > > > > > Attached is a patch as an attempt to address this issue. The idea is > > quite straightforward. When building partition info for joinrel, we > > generate any possible EC-derived joinclauses of form 'outer_em = > > inner_em', which will be used together with the original restrictlist to > > check if there exists an equi-join condition for each pair of partition > > keys. > > > > Any comments are welcome! > /* > + * generate_join_implied_equalities_for_all > + * Create any EC-derived joinclauses of form 'outer_em = inner_em'. > + * > + * This is used when building partition info for joinrel. > + */ > +List * > +generate_join_implied_equalities_for_all(PlannerInfo *root, > + Relids join_relids, > + Relids outer_relids, > + Relids inner_relids) > > I think we need to have more detailed comments about why we need this > separate function, we can also explain that > generate_join_implied_equalities function will avoid > the join clause if EC has the constant but for partition-wise join, we > need that clause too. > Thank you for the suggestion. > > + while ((i = bms_next_member(matching_ecs, i)) >= 0) > + { > + EquivalenceClass *ec = (EquivalenceClass *) list_nth(root->eq_classes, > i); > + List*outer_members = NIL; > + List*inner_members = NIL; > + ListCell *lc1; > + > + /* Do not consider this EC if it's ec_broken */ > + if (ec->ec_broken) > + continue; > + > + /* Single-member ECs won't generate any deductions */ > + if (list_length(ec->ec_members) <= 1) > + continue; > + > > I am wondering isn't it possible to just process the missing join > clause? I mean 'generate_join_implied_equalities' has only skipped > the ECs which has const so > can't we create join clause only for those ECs and append it the > "Restrictlist" we already have? I might be missing something? > For ECs without const, 'generate_join_implied_equalities' also has skipped some join clauses since it only selects the joinclause with 'best_score' between outer members and inner members. And the missing join clauses are needed to generate partitionwise join. That's why the query below cannot be planned as partitionwise join, as we have missed joinclause 'foo.k = bar.k'. select * from p as foo join p as bar on foo.k = bar.val and foo.k = bar.k; And yes 'generate_join_implied_equalities_for_all' will create join clauses that have existed in restrictlist. I think it's OK since the same RestrictInfo deduced from EC will share the same pointer and list_concat_unique_ptr will make sure there are no duplicates in the restrictlist used by have_partkey_equi_join. Thanks Richard
Re: psql - add SHOW_ALL_RESULTS option
On Fri, Sep 20, 2019 at 1:41 PM Fabien COELHO wrote: > > > > Should we add function header for the below function to maintain the > > common standard of this file: > > Yes. Attached v6 does that. > Thanks for fixing it. The below addition can be removed, it seems to be duplicate: @@ -3734,6 +3734,11 @@ listTables(const char *tabtypes, const char *pattern, bool verbose, bool showSys translate_columns[cols_so_far] = true; } + /* + * We don't bother to count cols_so_far below here, as there's no need + * to; this might change with future additions to the output columns. + */ + /* * We don't bother to count cols_so_far below here, as there's no need * to; this might change with future additions to the output columns. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Re: d25ea01275 and partitionwise join
On Thu, Sep 19, 2019 at 4:15 PM Amit Langote wrote: > Hi Richard, > > Thanks a lot for taking a close look at the patch and sorry about the > delay. > > On Wed, Sep 4, 2019 at 5:29 PM Richard Guo wrote: > >> On Wed, Sep 4, 2019 at 10:01 AM Amit Langote > wrote: > > I'm reviewing v2-0002 and I have concern about how COALESCE expr is > > processed in match_join_arg_to_partition_keys(). > > > > If there is a COALESCE expr with first arg being non-partition key expr > > and second arg being partition key, the patch would match it to the > > partition key, which may result in wrong results in some cases. > > > > For instance, consider the partition table below: > > > > create table p (k int, val int) partition by range(k); > > create table p_1 partition of p for values from (1) to (10); > > create table p_2 partition of p for values from (10) to (100); > > > > So with patch v2-0002, the following query will be planned with > > partitionwise join. > > > > # explain (costs off) > > select * from (p as t1 full join p as t2 on t1.k = t2.k) as > t12(k1,val1,k2,val2) > > full join p as t3 on COALESCE(t12.val1, > t12.k1) = t3.k; > > QUERY PLAN > > -- > > Append > >-> Hash Full Join > > Hash Cond: (COALESCE(t1.val, t1.k) = t3.k) > > -> Hash Full Join > >Hash Cond: (t1.k = t2.k) > >-> Seq Scan on p_1 t1 > >-> Hash > > -> Seq Scan on p_1 t2 > > -> Hash > >-> Seq Scan on p_1 t3 > >-> Hash Full Join > > Hash Cond: (COALESCE(t1_1.val, t1_1.k) = t3_1.k) > > -> Hash Full Join > >Hash Cond: (t1_1.k = t2_1.k) > >-> Seq Scan on p_2 t1_1 > >-> Hash > > -> Seq Scan on p_2 t2_1 > > -> Hash > >-> Seq Scan on p_2 t3_1 > > (19 rows) > > > > But as t1.val is not a partition key, actually we cannot use > > partitionwise join here. > > > > If we insert below data into the table, we will get wrong results for > > the query above. > > > > insert into p select 5,15; > > insert into p select 15,5; > > Good catch! It's quite wrong to use COALESCE(t12.val1, t12.k1) = t3.k > for partitionwise join as the COALESCE expression might as well output > the value of val1 which doesn't conform to partitioning. > > I've fixed match_join_arg_to_partition_keys() to catch that case and > fail. Added a test case as well. > > Please find attached updated patches. > Thank you for the fix. Will review. Thanks Richard
Re: Avoiding possible future conformance headaches in JSON work
On 2019-09-20 01:14, Tom Lane wrote: > Chapman Flack writes: >> Sure, against *every* non-spec feature we have or ever will have, someone >> /could/ raise a generic "what if SQL committee might add something pretty >> similar in future". >> But what we have in this case are specific non-spec features (array, map, >> and sequence constructors, lambdas, map/fold/reduce, user-defined >> functions) that are flat-out already present in the current version of >> the language that the SQL committee is clearly modeling jsonpath on. > > Sure. But we also modeled those features on the same language that the > committee is looking at (or at least I sure hope we did). So it's > reasonable to assume that they would come out at the same spot without > any prompting. And we can prompt them ;-). Also, I understand these are features proposed for PG13, not in PG12. So while this is an important discussion, it's not relevant to the PG12 release, right? (If so, I'm content to close these open items.) -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: pgbench - allow to create partitioned tables
On Fri, Sep 20, 2019 at 10:29 AM Fabien COELHO wrote: > > >> The behavior is not actually changed, but I had to move fillfactor away > >> because it cannot be declared on partitioned tables, it must be declared > >> on partitions only. Once there is a function to handle that it is pretty > >> easy to add the test. > >> > >> I can remove it but franckly there are only benefits: the default is now > >> tested by pgbench, the create query is smaller, and it would work with > >> older versions of pg, which does not matter but is good on principle. > > > > I am not saying that it is a bad check on its own, rather it might be > > good, but let's not do any unrelated change as that will delay the > > main patch. Once, we are done with the main patch, you can propose > > these as improvements. > > I would not bother to create a patch for so small an improvement. This > makes sense in passing because the created function makes it very easy, > but otherwise I'll just drop it. > I would prefer to drop for now. > >> The user could do a -i with a version of pgbench and bench with another > >> one. I do that often while developing… > > > > I am not following what you want to say here especially ("pgbench and > > bench with another one"). Can you explain with some example? > > While developing, I often run pgbench under development client against an > already created set of tables on an already created cluster, and usually > the server side on my laptop is the last major release from pgdg (ie 11.5) > while the pgbench I'm testing is from sources (ie 12dev). If I type > "pgbench" I run 11.5, and in the sources "./pgbench" runs the dev version. > Hmm, I think some such thing is possible when you are running pgbench of lower version with tables initialized by some higher version of pgbench. Because higher version pgbench must be a superset of lower version unless we drop support for one of the partitioning method. I think even if there is some unknown partition method, it should be detected much earlier rather than reaching the stage of printing the results like after the query for partitions in below code. + else + { + fprintf(stderr, "unexpected partition method: \"%s\"\n", ps); + exit(1); + } If we can't catch that earlier, then it might be better to have some version-specific checks rather than such obscure code which is difficult to understand for others. I have made a few modifications in the attached patch. * move the create partitions related code into a separate function. * make the check related to number of partitions consistent i.e check partitions > 0 apart from where we print which I also want to change but let us first discuss one of the above points * when we don't found pgbench_accounts table, error out instead of continuing * ensure append_fillfactor doesn't assume that it has to append fillfactor and removed fillfactor < 100 check from it. * improve the comments around query to fetch partitions * improve the comments in the patch and make the code look like nearby code * pgindent the patch I think we should try to add some note or comment that why we only choose to partition pgbench_accounts table when the user has given --partitions option. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com pgbench-init-partitioned-13.patch Description: Binary data
Re: [WIP] UNNEST(REFCURSOR): allowing SELECT to consume data from a REFCURSOR
Hi John, Thanks for pushing this, for me it looks like promising start! I need a bit more time to go through the code (and I'm not an expert in Postgres internals in any way) but I really appreciate you doing this. Roman
Re: [PATCH] src/test/modules/dummy_index -- way to test reloptions from inside of access method
On Fri, Sep 20, 2019 at 09:16:58AM +0900, Michael Paquier wrote: > On Thu, Sep 19, 2019 at 02:13:23PM +0300, Nikolay Shaplov wrote: >> What a good catch! dummy_index already proved to be useful ;-) > > Yes, the testing around custom reloptions is rather poor now, so this > module has value. I still don't like much its shape though, so I > began hacking on it for integration, and I wanted to get that part > done in this CF :) So... I have looked at the patch of upthread in details, and as I suspected the module is over-designed. First, on HEAD the coverage of reloptions.c is 86.6%, with your patch we get at 94.1%, and with the attached I reach 95.1% thanks to the addition of a string parameter with a NULL default value and a NULL description, for roughly half the code size. The GUCs are also basically not necessary, as you can just replace the various WARNING calls (please don't call elog on anything which can be reached by the user!) by lookups at reloptions in pg_class. Once this is removed, the whole code gets more simple, and there is no point in having neither a separate header nor a different set of files and the size of the whole module gets really reduced. I still need to do an extra pass on the code (particularly the AM part), but I think that we could commit that. Please note that I included the fix for the lockmode I sent today so as the patch can be tested: https://www.postgresql.org/message-id/20190920013831.gd1...@paquier.xyz Thoughts? -- Michael diff --git a/contrib/bloom/expected/bloom.out b/contrib/bloom/expected/bloom.out index 5ab9e34f82..dae12a7d3e 100644 --- a/contrib/bloom/expected/bloom.out +++ b/contrib/bloom/expected/bloom.out @@ -5,6 +5,7 @@ CREATE TABLE tst ( ); INSERT INTO tst SELECT i%10, substr(md5(i::text), 1, 1) FROM generate_series(1,2000) i; CREATE INDEX bloomidx ON tst USING bloom (i, t) WITH (col1 = 3); +ALTER INDEX bloomidx SET (length=80); SET enable_seqscan=on; SET enable_bitmapscan=off; SET enable_indexscan=off; diff --git a/contrib/bloom/sql/bloom.sql b/contrib/bloom/sql/bloom.sql index 32755f2b1a..4733e1e705 100644 --- a/contrib/bloom/sql/bloom.sql +++ b/contrib/bloom/sql/bloom.sql @@ -7,6 +7,7 @@ CREATE TABLE tst ( INSERT INTO tst SELECT i%10, substr(md5(i::text), 1, 1) FROM generate_series(1,2000) i; CREATE INDEX bloomidx ON tst USING bloom (i, t) WITH (col1 = 3); +ALTER INDEX bloomidx SET (length=80); SET enable_seqscan=on; SET enable_bitmapscan=off; diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c index 20f4ed3c38..b59e606771 100644 --- a/src/backend/access/common/reloptions.c +++ b/src/backend/access/common/reloptions.c @@ -659,6 +659,13 @@ allocate_reloption(bits32 kinds, int type, const char *name, const char *desc) newoption->namelen = strlen(name); newoption->type = type; + /* + * Set the default lock mode for this option. There is no actual way + * for a module to enforce it when declaring a custom relation option, + * so just use the highest level, which is safe for all cases. + */ + newoption->lockmode = AccessExclusiveLock; + MemoryContextSwitchTo(oldcxt); return newoption; diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile index 0e4e53d63e..b2eaef3bff 100644 --- a/src/test/modules/Makefile +++ b/src/test/modules/Makefile @@ -7,6 +7,7 @@ include $(top_builddir)/src/Makefile.global SUBDIRS = \ brin \ commit_ts \ + dummy_index_am \ dummy_seclabel \ snapshot_too_old \ test_bloomfilter \ diff --git a/src/test/modules/dummy_index_am/.gitignore b/src/test/modules/dummy_index_am/.gitignore new file mode 100644 index 00..5dcb3ff972 --- /dev/null +++ b/src/test/modules/dummy_index_am/.gitignore @@ -0,0 +1,4 @@ +# Generated subdirectories +/log/ +/results/ +/tmp_check/ diff --git a/src/test/modules/dummy_index_am/Makefile b/src/test/modules/dummy_index_am/Makefile new file mode 100644 index 00..b7e09b1469 --- /dev/null +++ b/src/test/modules/dummy_index_am/Makefile @@ -0,0 +1,20 @@ +# src/test/modules/dummy_index_am/Makefile + +MODULES = dummy_index_am + +EXTENSION = dummy_index_am +DATA = dummy_index_am--1.0.sql +PGFILEDESC = "dummy_index_am - minimized index access method" + +REGRESS = reloptions + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = src/test/modules/dummy_index_am +top_builddir = ../../../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif diff --git a/src/test/modules/dummy_index_am/README b/src/test/modules/dummy_index_am/README new file mode 100644 index 00..5563d060ac --- /dev/null +++ b/src/test/modules/dummy_index_am/README @@ -0,0 +1,11 @@ +Dummy Index AM +== + +Dummy index is an index module for testing any facility usable by an index +access method, whose code is kept a maximum simple. + +This includes tests for all relation option types: +- boolean +- integer +- r
Re: standby recovery fails (tablespace related) (tentative patch and discussion)
On Thu, Sep 19, 2019 at 5:29 PM Asim R P wrote: > > In order to fix the test failures, we need to distinguish between a missing database directory and a missing tablespace directory. And also add logic to forget missing directories during tablespace drop. I am working on it. Please find attached a solution that builds on what Paul has propose. A hash table, similar to the invalid page hash table is used to track missing directory references. A missing directory may be a tablespace or a database, based on whether the tablespace is found missing or the source database is found missing. The crash recovery succeeds if the hash table is empty at the end. Asim v6-0001-Support-node-initialization-from-backup-with-tabl.patch Description: Binary data v6-0002-Tests-to-replay-create-database-operation-on-stan.patch Description: Binary data v6-0003-Fix-replay-of-create-database-records-on-standby.patch Description: Binary data
WAL recycled despite logical replication slot
While testing something else (whether "terminating walsender process due to replication timeout" was happening spuriously), I had logical replication set up streaming a default pgbench transaction load, with the publisher being 13devel-e1c8743 and subscriber being 12BETA4. Eventually I started getting errors about requested wal segments being already removed: 10863 sub idle 0 2019-09-19 17:14:58.140 EDT LOG: starting logical decoding for slot "sub" 10863 sub idle 0 2019-09-19 17:14:58.140 EDT DETAIL: Streaming transactions committing after 79/EB0B17A0, reading WAL from 79/E70736A0. 10863 sub idle 58P01 2019-09-19 17:14:58.140 EDT ERROR: requested WAL segment 0001007900E7 has already been removed 10863 sub idle 0 2019-09-19 17:14:58.144 EDT LOG: disconnection: session time: 0:00:00.030 user=jjanes database=jjanes host=10.0.2.2 port=40830 It had been streaming for about 50 minutes before the error showed up, and it showed right when streaming was restarting after one of the replication timeouts. Is there an innocent explanation for this? I thought logical replication slots provided an iron-clad guarantee that WAL would be retained until it was no longer needed. I am just using pub/sub, none of the lower level stuff. Cheers, Jeff
Re: backup manifests
On Thu, Sep 19, 2019 at 11:10 PM David Steele wrote: > Overall we have found it's much simpler to label each backup and > cross-check that against the pg version and system id. Start LSN is > pretty unique, but backup labels work really well and are more widely > understood. I see your point, but part of my point is that uniqueness is not a technical requirement. However, it may be a requirement for user comprehension. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Conflict handling for COPY FROM
Hi Surafel, On 16.07.2019 10:08, Surafel Temesgen wrote: i also add an option to ignore all errors in ERROR set to -1 Great! The patch still applies cleanly (tested on e1c8743e6c), but I've got some problems using more elaborated tests. First of all, there is definitely a problem with grammar. In docs ERROR is defined as option and COPY test FROM '/path/to/copy-test-simple.csv' ERROR -1; works just fine, but if modern 'WITH (...)' syntax is used: COPY test FROM '/path/to/copy-test-simple.csv' WITH (ERROR -1); ERROR: option "error" not recognized while 'WITH (error_limit -1)' it works again. It happens, since COPY supports modern and very-very old syntax: * In the preferred syntax the options are comma-separated * and use generic identifiers instead of keywords. The pre-9.0 * syntax had a hard-wired, space-separated set of options. So I see several options here: 1) Everything is left as is, but then docs should be updated and reflect, that error_limit is required for modern syntax. 2) However, why do we have to support old syntax here? I guess it exists for backward compatibility only, but this is a completely new feature. So maybe just 'WITH (error_limit 42)' will be enough? 3) You also may simply change internal option name from 'error_limit' to 'error' or SQL keyword from 'ERROR' tot 'ERROR_LIMIT'. I would prefer the second option. Next, you use DestRemoteSimple for returning conflicting tuples back: + dest = CreateDestReceiver(DestRemoteSimple); + dest->rStartup(dest, (int) CMD_SELECT, tupDesc); However, printsimple supports very limited subset of built-in types, so CREATE TABLE large_test (id integer primary key, num1 bigint, num2 double precision); COPY large_test FROM '/path/to/copy-test.tsv'; COPY large_test FROM '/path/to/copy-test.tsv' ERROR 3; fails with following error 'ERROR: unsupported type OID: 701', which seems to be very confusing from the end user perspective. I've tried to switch to DestRemote, but couldn't figure it out quickly. Finally, I simply cannot get into this validation: + else if (strcmp(defel->defname, "error_limit") == 0) + { + if (cstate->ignore_error) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("conflicting or redundant options"), + parser_errposition(pstate, defel->location))); + cstate->error_limit = defGetInt64(defel); + cstate->ignore_error = true; + if (cstate->error_limit == -1) + cstate->ignore_all_error = true; + } If cstate->ignore_error is defined, then we have already processed options list, since this is the only one place, where it's set. So we should never get into this ereport, doesn't it? Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company 1100.5 2420.1 300
Re: backup manifests
On Thu, Sep 19, 2019 at 11:06 PM David Steele wrote: > > I am not crazy about JSON because it requires that I get a json parser > > into src/common, which I could do, but given the possibly-imminent end > > of the universe, I'm not sure it's the greatest use of time. You're > > right that if we pick an ad-hoc format, we've got to worry about > > escaping, which isn't lovely. > > My experience is that JSON is simple to implement and has already dealt > with escaping and data structure considerations. A home-grown solution > will be at least as complex but have the disadvantage of being non-standard. I think that's fair and just spent a little while investigating how difficult it would be to disentangle the JSON parser from the backend. It has dependencies on the following bits of backend-only functionality: - check_stack_depth(). No problem, I think. Just skip it for frontend code. - pg_mblen() / GetDatabaseEncoding(). Not sure what to do about this. Some of our infrastructure for dealing with encoding is available in the frontend and backend, but this part is backend-only. - elog() / ereport(). Kind of a pain. We could just kill the program if an error occurs, but that seems a bit ham-fisted. Refactoring the code so that the error is returned rather than thrown might be the way to go, but it's not simple, because you're not just passing a string. ereport(ERROR, (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), errmsg("invalid input syntax for type %s", "json"), errdetail("Character with value 0x%02x must be escaped.", (unsigned char) *s), report_json_context(lex))); - appendStringInfo et. al. I don't think it would be that hard to move this to src/common, but I'm also not sure it really solves the problem, because StringInfo has a 1GB limit, and there's no rule at all that a backup manifest has got to be less than 1GB. https://www.pgcon.org/2013/schedule/events/595.en.html This gets at another problem that I just started to think about. If the file is just a series of lines, you can parse it one line and a time and do something with that line, then move on. If it's a JSON blob, you have to parse the whole file and get a potentially giant data structure back, and then operate on that data structure. At least, I think you do. There's probably some way to create a callback structure that lets you presuppose that the toplevel data structure is an array (or object) and get back each element of that array (or key/value pair) as it's parsed, but that sounds pretty annoying to get working. Or we could just decide that you have to have enough memory to hold the parsed version of the entire manifest file in memory all at once, and if you don't, maybe you should drop some tables or buy more RAM. That still leaves you with bypassing the 1GB size limit on StringInfo, maybe by having a "huge" option, or perhaps by memory-mapping the file and then making the StringInfo point directly into the mapped region. Perhaps I'm overthinking this and maybe you have a simpler idea in mind about how it can be made to work, but I find all this complexity pretty unappealing. Here's a competing proposal: let's decide that lines consist of tab-separated fields. If a field contains a \t, \r, or \n, put a " at the beginning, a " at the end, and double any " that appears in the middle. This is easy to generate and easy to parse. It lets us completely ignore encoding considerations. Incremental parsing is straightforward. Quoting will rarely be needed because there's very little reason to create a file inside a PostgreSQL data directory that contains a tab or a newline, but if you do it'll still work. The lack of quoting is nice for humans reading the manifest, and nice in terms of keeping the manifest succinct; in contrast, note that using JSON doubles every backslash. I hear you saying that this is going to end up being just as complex in the end, but I don't think I believe it. It sounds to me like the difference between spending a couple of hours figuring this out and spending a couple of months trying to figure it out and maybe not actually getting anywhere. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: psql - add SHOW_ALL_RESULTS option
The below addition can be removed, it seems to be duplicate: Indeed. I'm unsure how this got into the patch, probably some rebase mix-up. Attached v7 removes the duplicates. -- Fabien.diff --git a/contrib/pg_stat_statements/expected/pg_stat_statements.out b/contrib/pg_stat_statements/expected/pg_stat_statements.out index 6787ec1efd..de59a5cf65 100644 --- a/contrib/pg_stat_statements/expected/pg_stat_statements.out +++ b/contrib/pg_stat_statements/expected/pg_stat_statements.out @@ -49,17 +49,42 @@ BEGIN \; SELECT 2.0 AS "float" \; SELECT 'world' AS "text" \; COMMIT; + float +--- + 2.0 +(1 row) + + text +--- + world +(1 row) + -- compound with empty statements and spurious leading spacing -\;\; SELECT 3 + 3 \;\;\; SELECT ' ' || ' !' \;\; SELECT 1 + 4 \;; - ?column? --- -5 +\;\; SELECT 3 + 3 AS "+" \;\;\; SELECT ' ' || ' !' AS "||" \;\; SELECT 1 + 4 AS "+" \;; + + +--- + 6 +(1 row) + + || +- + ! +(1 row) + + + +--- + 5 (1 row) -- non ;-terminated statements SELECT 1 + 1 + 1 AS "add" \gset SELECT :add + 1 + 1 AS "add" \; SELECT :add + 1 + 1 AS "add" \gset + add +- + 5 +(1 row) + -- set operator SELECT 1 AS i UNION SELECT 2 ORDER BY i; i @@ -102,12 +127,12 @@ SELECT query, calls, rows FROM pg_stat_statements ORDER BY query COLLATE "C"; SELECT $1+| 4 |4 +| | AS "text" | | - SELECT $1 + $2| 2 |2 SELECT $1 + $2 + $3 AS "add" | 3 |3 + SELECT $1 + $2 AS "+" | 2 |2 SELECT $1 AS "float" | 1 |1 SELECT $1 AS "int"| 2 |2 SELECT $1 AS i UNION SELECT $2 ORDER BY i | 1 |2 - SELECT $1 || $2 | 1 |1 + SELECT $1 || $2 AS "||" | 1 |1 SELECT pg_stat_statements_reset() | 1 |1 WITH t(f) AS ( +| 1 |2 VALUES ($1), ($2) +| | diff --git a/contrib/pg_stat_statements/sql/pg_stat_statements.sql b/contrib/pg_stat_statements/sql/pg_stat_statements.sql index 8b527070d4..ea3a31176e 100644 --- a/contrib/pg_stat_statements/sql/pg_stat_statements.sql +++ b/contrib/pg_stat_statements/sql/pg_stat_statements.sql @@ -27,7 +27,7 @@ SELECT 'world' AS "text" \; COMMIT; -- compound with empty statements and spurious leading spacing -\;\; SELECT 3 + 3 \;\;\; SELECT ' ' || ' !' \;\; SELECT 1 + 4 \;; +\;\; SELECT 3 + 3 AS "+" \;\;\; SELECT ' ' || ' !' AS "||" \;\; SELECT 1 + 4 AS "+" \;; -- non ;-terminated statements SELECT 1 + 1 + 1 AS "add" \gset diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml index 7789fc6177..4e6ab5a0a5 100644 --- a/doc/src/sgml/ref/psql-ref.sgml +++ b/doc/src/sgml/ref/psql-ref.sgml @@ -127,18 +127,11 @@ echo '\x \\ SELECT * FROM foo;' | psql commands included in the string to divide it into multiple transactions. (See for more details about how the server handles multi-query strings.) - Also, psql only prints the - result of the last SQL command in the string. - This is different from the behavior when the same string is read from - a file or fed to psql's standard input, - because then psql sends - each SQL command separately. - Because of this behavior, putting more than one SQL command in a - single -c string often has unexpected results. - It's better to use repeated -c commands or feed - multiple commands to psql's standard input, + If having several commands executed in one transaction is not desired, + use repeated -c commands or feed multiple commands to + psql's standard input, either using echo as illustrated above, or via a shell here-document, for example: @@ -3355,10 +3348,8 @@ select 1\; select 2\; select 3; commands included in the string to divide it into multiple transactions. (See for more details about how the server handles multi-query strings.) -psql prints only the last query result -it receives for each request; in this example, although all -three SELECTs are indeed executed, psql -only prints the 3. +psql prints all results it receives, one +after the other. diff --git a/src/bin/psql/common.c b/src/bin/psql/common.c index 4b2679360f..db6d031133 100644 --- a/src/bin/psql/common.c +++ b/src/bin/psql/common.c @@ -486,6 +486,19 @@ ResetCancelConn(void) #endif } +/* + * Show error message from result, if any, and check connection in passing. + */ +static void +ShowErrorMessage(const PGresult *resul
Re: backup manifests
On Thu, Sep 19, 2019 at 11:06 PM David Steele wrote: > > I don't think it's a good idea to duplicate the information that's > > already in the backup_label. Storing two copies of the same > > information is just an invitation to having to worry about what > > happens if they don't agree. > > OK, but now we have backup_label, tablespace_map, > ..backup (in the WAL) and now perhaps a > backup.manifest file. I feel like we may be drowning in backup info files. I agree! I'm not sure what to do about it, though. The information that is present in the tablespace_map file could have been stored in the backup_label file, I think, and that would have made sense, because both files are serving a very similar purpose: they tell the server that it needs to do some non-standard stuff when it starts up, and they give it instructions for what those things are. And, as a secondary purpose, humans or third-party tools can read them and use that information for whatever purpose they wish. The proposed backup_manifest file is a little different. I don't think that anyone is proposing that the server should read that file: it is there solely for the purpose of helping our own tools or third-party tools or human beings who are, uh, acting like tools.[1] We're also proposing to put it in a different place: the backup_label goes into one of the tar files, but the backup_manifest would sit outside of any tar file. If we were designing this from scratch, maybe we'd roll all of this into one file that serves as backup manifest, tablespace map, backup label, and backup history file, but then again, maybe separating the instructions-to-the-server part from the backup-integrity-checking part makes sense. At any rate, even if we knew for sure that's the direction we wanted to go, getting there from here looks a bit rough. If we just add a backup manifest, people who don't care can mostly ignore it and then should be mostly fine. If we start trying to create the one backup information system to rule them all, we're going to break people's tools. Maybe that's worth doing someday, but the paint isn't even dry on removing recovery.conf yet. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company [1] There are a surprising number of installations where, in effect, the DBA is the backup-and-restore tool, performing all the steps by hand and hoping not to mess any of them up. The fact that nearly every PostgreSQL company offers tools to make this easier does not seem to have done a whole lot to diminish the number of people using ad-hoc solutions.
Re: Usage of the system truststore for SSL certificate validation
On Thu, 19 Sep 2019 at 12:26, Isaac Morland wrote: > If we're going to open this up, can we add an option to say "this key is > allowed to log in to this account", SSH style? > > I like the idea of using keys rather than .pgpass, but I like the > ~/.ssh/authorized_keys model and don't like the "set up an entire > certificate infrastructure" approach. > Sorry for the top-post.
Re: backup manifests
On Fri, Sep 20, 2019 at 9:46 AM Robert Haas wrote: > - appendStringInfo et. al. I don't think it would be that hard to move > this to src/common, but I'm also not sure it really solves the > problem, because StringInfo has a 1GB limit, and there's no rule at > all that a backup manifest has got to be less than 1GB. Hmm. That's actually going to be a problem on the server side, no matter what we do on the client side. We have to send the manifest after we send everything else, so that we know what we sent. But if we sent a lot of files, the manifest might be really huge. I had been thinking that we would generate the manifest on the server and send it to the client after everything else, but maybe this is an argument for generating the manifest on the client side and writing it incrementally. That would require the client to peek at the contents of every tar file it receives all the time, which it currently doesn't need to do, but it does peek inside them a little bit, so maybe it's OK. Another alternative would be to have the server spill the manifest in progress to a temp file and then stream it from there to the client. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: backup manifests
On 9/20/19 9:46 AM, Robert Haas wrote: > On Thu, Sep 19, 2019 at 11:06 PM David Steele wrote: > >> My experience is that JSON is simple to implement and has already dealt >> with escaping and data structure considerations. A home-grown solution >> will be at least as complex but have the disadvantage of being non-standard. > > I think that's fair and just spent a little while investigating how > difficult it would be to disentangle the JSON parser from the backend. > It has dependencies on the following bits of backend-only > functionality: > - elog() / ereport(). Kind of a pain. We could just kill the program > if an error occurs, but that seems a bit ham-fisted. Refactoring the > code so that the error is returned rather than thrown might be the way > to go, but it's not simple, because you're not just passing a string. Seems to me we are overdue for elog()/ereport() compatible error-handling in the front end. Plus mem contexts. It sucks to make that a prereq for this project but the longer we kick that can down the road... > https://www.pgcon.org/2013/schedule/events/595.en.html This talk was good fun. The largest number of tables we've seen is a few hundred thousand, but that still adds up to more than a million files to backup. > This gets at another problem that I just started to think about. If > the file is just a series of lines, you can parse it one line and a > time and do something with that line, then move on. If it's a JSON > blob, you have to parse the whole file and get a potentially giant > data structure back, and then operate on that data structure. At > least, I think you do. JSON can definitely be parsed incrementally, but for practical reasons certain structures work better than others. > There's probably some way to create a callback > structure that lets you presuppose that the toplevel data structure is > an array (or object) and get back each element of that array (or > key/value pair) as it's parsed, but that sounds pretty annoying to get > working. And that's how we do it. It's annoying and yeah it's complicated but it is very fast and memory-efficient. > Or we could just decide that you have to have enough memory > to hold the parsed version of the entire manifest file in memory all > at once, and if you don't, maybe you should drop some tables or buy > more RAM. I assume you meant "un-parsed" here? > That still leaves you with bypassing the 1GB size limit on > StringInfo, maybe by having a "huge" option, or perhaps by > memory-mapping the file and then making the StringInfo point directly > into the mapped region. Perhaps I'm overthinking this and maybe you > have a simpler idea in mind about how it can be made to work, but I > find all this complexity pretty unappealing. Our String object has the same 1GB limit. Partly because it works and saves a bit of memory per object, but also because if we find ourselves exceeding that limit we know we've probably made a design error. Parsing in stream means that you only need to store the final in-memory representation of the manifest which can be much more compact. Yeah, it's complicated, but the memory and time savings are worth it. Note that our Perl implementation took the naive approach and has worked pretty well for six years, but can choke on really large manifests with out of memory errors. Overall, I'd say getting the format right is more important than having the perfect initial implementation. > Here's a competing proposal: let's decide that lines consist of > tab-separated fields. If a field contains a \t, \r, or \n, put a " at > the beginning, a " at the end, and double any " that appears in the > middle. This is easy to generate and easy to parse. It lets us > completely ignore encoding considerations. Incremental parsing is > straightforward. Quoting will rarely be needed because there's very > little reason to create a file inside a PostgreSQL data directory that > contains a tab or a newline, but if you do it'll still work. The lack > of quoting is nice for humans reading the manifest, and nice in terms > of keeping the manifest succinct; in contrast, note that using JSON > doubles every backslash. There's other information you'll want to store that is not strictly file info so you need a way to denote that. It gets complicated quickly. > I hear you saying that this is going to end up being just as complex > in the end, but I don't think I believe it. It sounds to me like the > difference between spending a couple of hours figuring this out and > spending a couple of months trying to figure it out and maybe not > actually getting anywhere. Maybe the initial implementation will be easier but I am confident we'll pay for it down the road. Also, don't we want users to be able to read this file? Do we really want them to need to cook up a custom parser in Perl, Go, Python, etc.? -- -David da...@pgmasters.net
Re: Global temporary tables
I have added support of all indexes (brin, btree, gin, gist, hash, spgist) for global temp tables (before only B-Tree index was supported). It will be nice to have some generic mechanism for it, but I do not understand how it can look like. The problem is that normal relations are initialized at the moment of their creation. But for global temp relations metadata already exists while data is absent. We should somehow catch such access to not initialized page (but not not all pages, but just first page of relation) and perform initialization on demand. New patch for global temp tables with shared buffers is attached. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c index 1bd579f..2d93f6f 100644 --- a/contrib/pg_buffercache/pg_buffercache_pages.c +++ b/contrib/pg_buffercache/pg_buffercache_pages.c @@ -153,9 +153,9 @@ pg_buffercache_pages(PG_FUNCTION_ARGS) buf_state = LockBufHdr(bufHdr); fctx->record[i].bufferid = BufferDescriptorGetBuffer(bufHdr); - fctx->record[i].relfilenode = bufHdr->tag.rnode.relNode; - fctx->record[i].reltablespace = bufHdr->tag.rnode.spcNode; - fctx->record[i].reldatabase = bufHdr->tag.rnode.dbNode; + fctx->record[i].relfilenode = bufHdr->tag.rnode.node.relNode; + fctx->record[i].reltablespace = bufHdr->tag.rnode.node.spcNode; + fctx->record[i].reldatabase = bufHdr->tag.rnode.node.dbNode; fctx->record[i].forknum = bufHdr->tag.forkNum; fctx->record[i].blocknum = bufHdr->tag.blockNum; fctx->record[i].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state); diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c index 38ae240..8a04954 100644 --- a/contrib/pg_prewarm/autoprewarm.c +++ b/contrib/pg_prewarm/autoprewarm.c @@ -608,9 +608,9 @@ apw_dump_now(bool is_bgworker, bool dump_unlogged) if (buf_state & BM_TAG_VALID && ((buf_state & BM_PERMANENT) || dump_unlogged)) { - block_info_array[num_blocks].database = bufHdr->tag.rnode.dbNode; - block_info_array[num_blocks].tablespace = bufHdr->tag.rnode.spcNode; - block_info_array[num_blocks].filenode = bufHdr->tag.rnode.relNode; + block_info_array[num_blocks].database = bufHdr->tag.rnode.node.dbNode; + block_info_array[num_blocks].tablespace = bufHdr->tag.rnode.node.spcNode; + block_info_array[num_blocks].filenode = bufHdr->tag.rnode.node.relNode; block_info_array[num_blocks].forknum = bufHdr->tag.forkNum; block_info_array[num_blocks].blocknum = bufHdr->tag.blockNum; ++num_blocks; diff --git a/contrib/pgrowlocks/pgrowlocks.c b/contrib/pgrowlocks/pgrowlocks.c index a2c44a9..43b4c66 100644 --- a/contrib/pgrowlocks/pgrowlocks.c +++ b/contrib/pgrowlocks/pgrowlocks.c @@ -158,7 +158,8 @@ pgrowlocks(PG_FUNCTION_ARGS) /* must hold a buffer lock to call HeapTupleSatisfiesUpdate */ LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE); - htsu = HeapTupleSatisfiesUpdate(tuple, + htsu = HeapTupleSatisfiesUpdate(mydata->rel, + tuple, GetCurrentCommandId(false), hscan->rs_cbuf); xmax = HeapTupleHeaderGetRawXmax(tuple->t_data); diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c index 70af43e..9cce720 100644 --- a/contrib/pgstattuple/pgstattuple.c +++ b/contrib/pgstattuple/pgstattuple.c @@ -349,7 +349,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo) /* must hold a buffer lock to call HeapTupleSatisfiesVisibility */ LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE); - if (HeapTupleSatisfiesVisibility(tuple, &SnapshotDirty, hscan->rs_cbuf)) + if (HeapTupleSatisfiesVisibility(rel, tuple, &SnapshotDirty, hscan->rs_cbuf)) { stat.tuple_len += tuple->t_len; stat.tuple_count++; diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c index e2bfbf8..97041a8 100644 --- a/src/backend/access/brin/brin_revmap.c +++ b/src/backend/access/brin/brin_revmap.c @@ -25,6 +25,7 @@ #include "access/brin_revmap.h" #include "access/brin_tuple.h" #include "access/brin_xlog.h" +#include "access/brin.h" #include "access/rmgr.h" #include "access/xloginsert.h" #include "miscadmin.h" @@ -79,6 +80,11 @@ brinRevmapInitialize(Relation idxrel, BlockNumber *pagesPerRange, meta = ReadBuffer(idxrel, BRIN_METAPAGE_BLKNO); LockBuffer(meta, BUFFER_LOCK_SHARE); page = BufferGetPage(meta); + + if (GlobalTempRelationPageIsNotInitialized(idxrel, page)) + brin_metapage_init(page, BrinGetPagesPerRange(idxrel), + BRIN_CURRENT_VERSION); + TestForOldSnapshot(snapshot, idxrel, page); metadata = (BrinMetaPageData *) PageGetContents(page); diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c index 439a91b..8a6ac71 100644 --- a/src/backend/access/gin/ginfast.c +++ b/src/backend/access/gin/ginfast.c @@ -241,6 +241,16 @@ ginHeapTupleFastInsert(GinState *ginstate, GinTupleCollector *collect
Re: Option to dump foreign data in pg_dump
Hello, thanks for the comments! *I suggest the option to be just –foreign-data because if we make it –include-foreign-data its expected to have –exclude-foreign-data option too. Several pg_dump options have no counterpart, e.g --enable-row-security does not have a disable (which is the default). Also calling it --foreign-data would sound similar to the --table, by default all tables are dumped, but with --table only the selected tables are dumped. While without --include-foreign-data all data is excluded, and only with the option some foreign data would be included. *please add test case I added tests cases for the invalid inputs. I'll try to make a test case for the actual dump of foreign data, but that requires more setup, because a functional fdw is needed there. * + if (tdinfo->filtercond || tbinfo->relkind == RELKIND_FOREIGN_TABLE) filter condition is not implemented completely yet so the logic only work on foreign table so I think its better to handle it separately Note that there is another if condition that actually applies the the filtercondition if provided, also for a foreign table we need to do a COPY SELECT instead of a COPY TO * I don’t understand the need for changing SELECT query .we can use the same SELECT query syntax for both regular table and foreign table To which query do you refer? In the patch there are three queries: 1 retrieves foreign servers, another is the SELECT in the COPY that now it applies in case of a filter condition of a foreign table, and a third that retrieves the oid of a given foreign server. > As you have specified required_argument in above: > + {"include-foreign-data", required_argument, NULL, 11}, > > The below check may not be required: > + if (strcmp(optarg, "") == 0) > + { > + pg_log_error("empty string is not a valid pattern in > --include-foreign-data"); > + exit_nicely(1); > + } We need to conserve this check to avoid that the use of '--include-foreign-data=', which would match all foreign servers. And in previous messages it was established that that behavior is too coarse. > > + if (foreign_servers_include_patterns.head != NULL) > + { > + expand_foreign_server_name_patterns(fout, &foreign_servers_include_patterns, > + &foreign_servers_include_oids); > + if (foreign_servers_include_oids.head == NULL) > + fatal("no matching foreign servers were found"); > + } > + > > The above check if (foreign_servers_include_oids.head == NULL) may not > be required, as there is a check present inside > expand_foreign_server_name_patterns to handle this error: > + > + res = ExecuteSqlQuery(fout, query->data, PGRES_TUPLES_OK); > + if (PQntuples(res) == 0) > + fatal("no matching foreign servers were found for pattern \"%s\"", > cell->val); > + Removed > > +static void > +expand_foreign_server_name_patterns(Archive *fout, > + SimpleStringList *patterns, > + SimpleOidList *oids) > +{ > + PQExpBuffer query; > + PGresult *res; > + SimpleStringListCell *cell; > + int i; > + > + if (patterns->head == NULL) > + return; /* nothing to do */ > + > > The above check for patterns->head may not be required as similar > check exists before this function is called: > + if (foreign_servers_include_patterns.head != NULL) > + { > + expand_foreign_server_name_patterns(fout, &foreign_servers_include_patterns, > + &foreign_servers_include_oids); > + if (foreign_servers_include_oids.head == NULL) > + fatal("no matching foreign servers were found"); > + } > + I think that it is better that the function expand_foreign_server_name do not rely on a non-NULL head, so it checks it by itself, and is closer to the other expand_* functions. Instead I've removed the check before the function is called. > > + /* Skip FOREIGN TABLEs (no data to dump) if not requested explicitly */ > + if (tbinfo->relkind == RELKIND_FOREIGN_TABLE && > + (foreign_servers_include_oids.head == NULL || > + !simple_oid_list_member(&foreign_servers_include_oids, > tbinfo->foreign_server_oid))) > simple_oid_list_member can be split into two lines Done Cheers Luis M Carril diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml index 4bcd4bdaef..319851b936 100644 --- a/doc/src/sgml/ref/pg_dump.sgml +++ b/doc/src/sgml/ref/pg_dump.sgml @@ -767,6 +767,34 @@ PostgreSQL documentation + + --include-foreign-data=foreignserver + + +Dump the data for any foreign table with a foreign server +matching foreignserver +pattern. Multiple foreign servers can be selected by writing multiple --include-foreign-data. +Also, the foreignserver parameter is +interpreted as a pattern according to the same rules used by +psql's \d commands (see ), +so multiple foreign servers can also be selected by writing wildcard characters +in the pattern. When using wildcards, be careful to quote the pattern +if needed to prevent the shell from expanding the wildcards; see +. +
Re: backup manifests
On 9/20/19 10:59 AM, Robert Haas wrote: > On Fri, Sep 20, 2019 at 9:46 AM Robert Haas wrote: >> - appendStringInfo et. al. I don't think it would be that hard to move >> this to src/common, but I'm also not sure it really solves the >> problem, because StringInfo has a 1GB limit, and there's no rule at >> all that a backup manifest has got to be less than 1GB. > > Hmm. That's actually going to be a problem on the server side, no > matter what we do on the client side. We have to send the manifest > after we send everything else, so that we know what we sent. But if we > sent a lot of files, the manifest might be really huge. I had been > thinking that we would generate the manifest on the server and send it > to the client after everything else, but maybe this is an argument for > generating the manifest on the client side and writing it > incrementally. That would require the client to peek at the contents > of every tar file it receives all the time, which it currently doesn't > need to do, but it does peek inside them a little bit, so maybe it's > OK. > > Another alternative would be to have the server spill the manifest in > progress to a temp file and then stream it from there to the client. This seems reasonable to me. We keep an in-memory representation which is just an array of structs and is fairly compact -- 1 million files uses ~150MB of memory. We just format and stream this to storage when saving. Saving is easier than loading, of course. -- -David da...@pgmasters.net
Re: backup manifests
On 9/20/19 9:46 AM, Robert Haas wrote: > least, I think you do. There's probably some way to create a callback > structure that lets you presuppose that the toplevel data structure is > an array (or object) and get back each element of that array (or > key/value pair) as it's parsed, If a JSON parser does find its way into src/common, it probably wants to have such an incremental mode available, similar to [2] offered in the "Jackson" library for Java. The Jackson developer has propounded a thesis[1] that such a parsing library ought to offer "Three -- and Only Three" different styles of API corresponding to three ways of organizing the code using the library ([2], [3], [4], which also resemble the different APIs supplied in Java for XML processing). Regards, -Chap [1] http://www.cowtowncoder.com/blog/archives/2009/01/entry_132.html [2] http://www.cowtowncoder.com/blog/archives/2009/01/entry_137.html [3] http://www.cowtowncoder.com/blog/archives/2009/01/entry_153.html [4] http://www.cowtowncoder.com/blog/archives/2009/01/entry_152.html
Re: WAL recycled despite logical replication slot
On Fri, Sep 20, 2019 at 08:45:34AM -0400, Jeff Janes wrote: While testing something else (whether "terminating walsender process due to replication timeout" was happening spuriously), I had logical replication set up streaming a default pgbench transaction load, with the publisher being 13devel-e1c8743 and subscriber being 12BETA4. Eventually I started getting errors about requested wal segments being already removed: 10863 sub idle 0 2019-09-19 17:14:58.140 EDT LOG: starting logical decoding for slot "sub" 10863 sub idle 0 2019-09-19 17:14:58.140 EDT DETAIL: Streaming transactions committing after 79/EB0B17A0, reading WAL from 79/E70736A0. 10863 sub idle 58P01 2019-09-19 17:14:58.140 EDT ERROR: requested WAL segment 0001007900E7 has already been removed 10863 sub idle 0 2019-09-19 17:14:58.144 EDT LOG: disconnection: session time: 0:00:00.030 user=jjanes database=jjanes host=10.0.2.2 port=40830 It had been streaming for about 50 minutes before the error showed up, and it showed right when streaming was restarting after one of the replication timeouts. Is there an innocent explanation for this? I thought logical replication slots provided an iron-clad guarantee that WAL would be retained until it was no longer needed. I am just using pub/sub, none of the lower level stuff. I think you're right - this should not happen with replication slots. Can you provide more detailed setup instructions, so that I can try to reproduce and investigate the isssue? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: subscriptionCheck failures on nightjar
On Thu, Sep 19, 2019 at 01:23:05PM +0900, Michael Paquier wrote: On Wed, Sep 18, 2019 at 11:58:08PM +0200, Tomas Vondra wrote: I kinda suspect it might be just a coincidence that it fails during that particular test. What likely plays a role here is a checkpoint timing (AFAICS that's the thing removing the file). On most systems the tests complete before any checkpoint is triggered, hence no issue. Maybe aggressively triggering checkpoints on the running cluter from another session would do the trick ... Now that I recall, another thing I forgot to mention on this thread is that I patched guc.c to reduce the minimum of checkpoint_timeout to 1s. But even with that change you haven't managed to reproduce the issue, right? Or am I misunderstanding? regarss -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2019-Sep-19, Robert Haas wrote: > So, earlier in this thread, I suggested making this part of ALTER > TABLE, and several people seemed to like that idea. Did we have a > reason for dropping that approach? Hmm, my own reading of that was to add tablespace changing abilities to ALTER TABLE *in addition* to this patch, not instead of it. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Global temporary tables
st 18. 9. 2019 v 12:04 odesílatel Konstantin Knizhnik < k.knizh...@postgrespro.ru> napsal: > > > On 21.08.2019 11:54, Konstantin Knizhnik wrote: > > > > On 20.08.2019 20:01, Pavel Stehule wrote: > > Another solution is wait on ZHeap storage and replica can to have own UNDO > log. > >> >> I thought about implementation of special table access method for >> temporary tables. >> > > +1 > > > Unfortunately implementing special table access method for temporary > tables doesn't solve all problems. > XID generation is not part of table access methods. > So we still need to assign some XID to write transaction at replica which > will not conflict with XIDs received from master. > Actually only global temp tables can be updated at replica and so assigned > XIDs can be stored only in tuples of such relations. > But still I am not sure that we can use arbitrary XID for such > transactions at replica. > > Also I upset by amount of functionality which has to be reimplemented for > global temp tables if we really want to provide access method for them: > > 1. CLOG > 2. vacuum > 3. MVCC visibility > > And still it is not possible to encapsulate all changes need to support > writes to temp tables at replica inside table access method. > XID assignment, transaction commit and abort, subtransactions - all this > places need to be patched. > > > I was able to fully support work with global temp tables at replica > (including subtransactions). > The patch is attached. Also you can find this version in > https://github.com/postgrespro/postgresql.builtin_pool/tree/global_temp_hot > > Right now transactions at replica updating global temp table are assigned > special kind of GIDs which are not related with XIDs received from master. > So special visibility rules are used for such tables at replica. Also I > have to patch TransactionIdIsInProgress, TransactionIdDidCommit, > TransactionIdGetCurrent > functions to correctly handle such XIDs. In principle it is possible to > implement global temp tables as special heap access method. But it will > require copying a lot of code (heapam.c) > so I prefer to add few checks to existed functions. > > There are still some limitations: > - Number of transactions at replica which update temp tables is limited by > 2^32 (wraparound problem is not addressed). > - I have to maintain in-memory analog of CLOG for such transactions which > is also not cropped. It means that for 2^32 transaction size of bitmap can > grow up to 0.5Gb. > > I try to understand what are the following steps in global temp tables > support. > This is why I want to perform short survey - what people are expecting > from global temp tables: > > 1. I do not need them at all. > 2. Eliminate catalog bloating. > 3. Mostly needed for compatibility with Oracle (simplify porting,...). > 4. Parallel query execution. > 5. Can be used at replica. > 6. More efficient use of resources (first of all memory). > There can be other point important for cloud. Inside some cloud usually there are two types of discs - persistent (slow) and ephemeral (fast). We effectively used temp tables there because we moved temp tablespace to ephemeral discs. I missing one point in your list - developer's comfort - using temp tables is just much more comfortable - you don't need create it again, again, .. Due this behave is possible to reduce @2 and @3 can be nice side effect. If you reduce @2 to zero, then @5 should be possible without any other. Pavel > -- > Konstantin Knizhnik > Postgres Professional: http://www.postgrespro.com > The Russian Postgres Company > >
Re: [HACKERS] CLUSTER command progress monitor
On 2019-Sep-18, Michael Paquier wrote: > On Tue, Sep 17, 2019 at 10:50:22PM -0300, Alvaro Herrera wrote: > > On 2019-Sep-18, Michael Paquier wrote: > >> So, with the clock ticking and the release getting close by, what do > >> we do for this set of issues? REINDEX, CREATE INDEX and CLUSTER all > >> try to build indexes and the current infrastructure is not really > >> adapted to hold all that. Robert, Alvaro and Peter E, do you have any > >> comments to offer? > > > > Which part of it is not already fixed? > > I can still see at least two problems. There is one issue with > pgstat_progress_update_param() which gets called in reindex_table() > for a progress phase of CLUSTER, and this even if > REINDEXOPT_REPORT_PROGRESS is not set in the options. (You mean reindex_relation.) ... but that param update is there for CLUSTER, not for REINDEX, so if we made it dependent on the option to turn on CREATE INDEX progress updates, it would break CLUSTER progress reporting. Also, the parameter being updated is not used by CREATE INDEX, so there's no spurious change. I think this ain't broke, and thus it don't need fixin'. If anything, I would like the CLUSTER report to show index creation progress, which would go the opposite way. But that seems a patch for pg13. > Also it seems > to me that the calls to pgstat_progress_start_command() and > pgstat_progress_end_command() are at incorrect locations for > reindex_index() and that those should be one level higher on the stack > to avoid any kind of interactions with another command whose progress > has already started. That doesn't work, because the caller doesn't have the OID of the table, which pgstat_progress_start_command() needs. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: A problem presentaion about ECPG, DECLARE STATEMENT
"kuroda.hay...@fujitsu.com" writes: >> If a solid case can be made that ECPG's DECLARE STATEMENT was done >> wrong, we'd be better off to just revert the feature out of v12 >> and try again, under less time pressure, for v13. > I see, I'll propose this at the next commitfest. > But I'm now considering this commit should be reverted in order to avoid > the confusion. Per this discussion, I've reverted DECLARE STATEMENT out of v12 and HEAD branches. One thing that could use more eyeballs on it is the changes in ecpg_register_prepared_stmt(); that was added after DECLARE STATEMENT so there was no prior state to revert to, and I had to guess a bit. What I guessed, seeing that the lone caller of that function is already using stmt->connection, was that it was completely bogus for ecpg_register_prepared_stmt() to be doing its own new connection lookup and it should just use stmt->connection. But I might be wrong since I'm not too clear about where connection lookups are supposed to be done in this library. Another comment is that this was one of the more painful reverts I've ever had to do. Some of the pain was unavoidable because there were later commits (mostly the PREPARE AS one) changing adjacent code ... but a lot of it was due to flat-out sloppiness in the DECLARE STATEMENT patch, particularly with respect to whitespace. Please run the next submission through pgindent beforehand. Also please pay attention to the documentation cleanups that other people made after the initial patch. We don't want to have to repeat that cleanup work a second time. regards, tom lane
Re: subscriptionCheck failures on nightjar
Hi, On 2019-09-19 17:20:15 +0530, Kuntal Ghosh wrote: > It seems there is a pattern how the error is occurring in different > systems. Following are the relevant log snippets: > > nightjar: > sub3 LOG: received replication command: CREATE_REPLICATION_SLOT > "sub3_16414_sync_16394" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT > sub3 LOG: logical decoding found consistent point at 0/160B578 > sub1 PANIC: could not open file > "pg_logical/snapshots/0-160B578.snap": No such file or directory > > dromedary scenario 1: > sub3_16414_sync_16399 LOG: received replication command: > CREATE_REPLICATION_SLOT "sub3_16414_sync_16399" TEMPORARY LOGICAL > pgoutput USE_SNAPSHOT > sub3_16414_sync_16399 LOG: logical decoding found consistent point at > 0/15EA694 > sub2 PANIC: could not open file > "pg_logical/snapshots/0-15EA694.snap": No such file or directory > > > dromedary scenario 2: > sub3_16414_sync_16399 LOG: received replication command: > CREATE_REPLICATION_SLOT "sub3_16414_sync_16399" TEMPORARY LOGICAL > pgoutput USE_SNAPSHOT > sub3_16414_sync_16399 LOG: logical decoding found consistent point at > 0/15EA694 > sub1 PANIC: could not open file > "pg_logical/snapshots/0-15EA694.snap": No such file or directory > > While subscription 3 is created, it eventually reaches to a consistent > snapshot point and prints the WAL location corresponding to it. It > seems sub1/sub2 immediately fails to serialize the snapshot to the > .snap file having the same WAL location. Since now a number of people (I tried as well), failed to reproduce this locally, I propose that we increase the log-level during this test on master. And perhaps expand the set of debugging information. With the hope that the additional information on the cases encountered on the bf helps us build a reproducer or, even better, diagnose the issue directly. If people agree, I'll come up with a patch. Greetings, Andres Freund
Re: log bind parameter values on error
Hi, On 2019-09-18 17:58:53 -0300, Alvaro Herrera wrote: > --- a/doc/src/sgml/config.sgml > +++ b/doc/src/sgml/config.sgml > @@ -6414,6 +6414,23 @@ log_line_prefix = '%m [%p] %q%u@%d/%a ' > > > > + xreflabel="log_parameters_on_error"> > + log_parameters_on_error (boolean) > + > + log_parameters_on_error configuration > parameter > + > + > + > + > +Controls whether the statement is logged with bind parameter values. Trailing whitespace. I find the "the statement" formulation a bit odd, but can't quite put my finger on why. I think it might be worthwhile to cross-reference log_min_error_statement, as log_parameters_on_error will only take effect when the statement is logged due to log_min_error_statement. > +It adds some overhead, as postgres will cache textual > +representations of parameter values in memory for all statements, > +even if they eventually do not get logged. The default is > +off. Only superusers can change this setting. I don't think "cache" is the right descriptor. Usually caching implies trying to make repeated tasks faster, which isn't the case here. > +/* > + * The top-level portal that the client is immediately working with: > + * creating, binding, executing, or all at one using simple protocol > + */ > +Portal current_top_portal = NULL; > + This strikes me as decidedly not nice. For one this means that we'll not be able to use this infrastructure for binds that are done serverside, e.g. as part of plpgsql. I'm basically inclined to think that integrating this at the postges.c level is the wrong thing. It also adds new error handling complexity, which is already quite substantial. And spreads knowledge of portals to elog.c, which imo shouldn't have to know about them. It seems to me that this would really need to be tracked inside the portal infrastructure. To avoid unnecessary overhead, we could continue to set the text values in exec_bin_message() in the pformat == 0 case, using hasTextValues somewhere in the portal code to determine whether the text representation has to be computed (for other input format, and internal queries as e.g. generated by plpgsql). And then the PG_CATCHes in pquery.c can add the errdetail() in the error cases using an econtext callback. > +/* > + * get_portal_bind_params > + * Get a string containing parameters data -- used for logging. > + * > + * Can return NULL if there are no parameters in the portal or the portal is > + * not valid, or the text representations of the parameters are not > available. > + * If returning a non-NULL value, it allocates memory for the returned string > + * in the current context, and it's the caller's responsibility to pfree() > it. > + */ > +char * > +get_portal_bind_params(ParamListInfo params) > +{ > + StringInfoData param_str; > + > + /* No parameters to format */ > + if (!params || params->numParams == 0) > + return NULL; > + > + /* > + * We either need textual representation of parameters pre-calculated, > or > + * call potentially user-defined I/O functions to convert the internal > + * representation into text, which cannot be done in an aborted xact > + */ > + if (!params->hasTextValues && IsAbortedTransactionBlockState()) > + return NULL; Maybe I'm naive, but what is the point of keeping the separate parameters, allocated separately, when all we're doing is building a string containing them all at error time? Seems better just directly form the full string when decideding to keep the text parameters - for one it'll often end up being more efficient. But more importantly it also makes it a lot less likely to run out of memory while handling the error. The individual text parameters can be large, and this will always additionally need at least the combined size of all parameters from within the error context. That's not great. > + appendStringInfoCharMacro(¶m_str, '\''); > + for (p = pstring; *p; p++) > + { > + if (*p == '\'') /* double single quotes */ > + appendStringInfoCharMacro(¶m_str, *p); > + appendStringInfoCharMacro(¶m_str, *p); > + } > + appendStringInfoCharMacro(¶m_str, '\''); I know this is old code, but: This is really inefficient. Will cause a lot of unnecessary memory-reallocations for large text outputs, because we don't immediately grow to at least close to the required size. Greetings, Andres Freund
Re: Add "password_protocol" connection parameter to libpq
On Fri, 2019-09-20 at 13:07 +0900, Michael Paquier wrote: > Those are mainly nits, and attached are the changes I would do to > your > patch. Please feel free to consider those changes as you see fit. > Anyway, the base logic looks good to me, so I am switching the patch > as ready for committer. Thank you, applied. * I also changed the comment above pg_fe_scram_channel_bound() to clarify that the caller must also check that the exchange was successful. * I changed the error message when AUTH_REQ_OK is received without channel binding. It seemed confusing before. I also added a test. Regards, Jeff Davis From 1266d7bb6c46aa85b3c48ee271115e5ce6f4bad0 Mon Sep 17 00:00:00 2001 From: Jeff Davis Date: Tue, 20 Aug 2019 18:59:23 -0700 Subject: [PATCH] Add libpq parameter 'channel_binding'. Allow clients to require channel binding to enhance security against untrusted servers. Author: Jeff Davis Discussion: https://postgr.es/m/227015d8417f2b4fef03f8966dbfa5cbcc4f44da.camel%40j-davis.com --- doc/src/sgml/libpq.sgml | 32 ++ src/interfaces/libpq/fe-auth-scram.c | 41 ++-- src/interfaces/libpq/fe-auth.c| 76 +-- src/interfaces/libpq/fe-auth.h| 1 + src/interfaces/libpq/fe-connect.c | 41 +++- src/interfaces/libpq/libpq-int.h | 2 + src/test/authentication/t/001_password.pl | 12 +++- src/test/ssl/t/002_scram.pl | 39 +++- src/test/ssl/t/SSLServer.pm | 9 ++- 9 files changed, 233 insertions(+), 20 deletions(-) diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml index 1189341ca15..c58527b0c3b 100644 --- a/doc/src/sgml/libpq.sgml +++ b/doc/src/sgml/libpq.sgml @@ -1122,6 +1122,28 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname + + channel_binding + + +This option controls the client's use of channel binding. A setting +of require means that the connection must employ +channel binding, prefer means that the client will +choose channel binding if available, and disable +prevents the use of channel binding. The default +is prefer if +PostgreSQL is compiled with SSL support; +otherwise the default is disable. + + +Channel binding is a method for the server to authenticate itself to +the client. It is only supported over SSL connections +with PostgreSQL 11 or later servers using +the SCRAM authentication method. + + + + connect_timeout @@ -6864,6 +6886,16 @@ myEventProc(PGEventId evtId, void *evtInfo, void *passThrough) + + + + PGCHANNELBINDING + + PGCHANNELBINDING behaves the same as the connection parameter. + + + diff --git a/src/interfaces/libpq/fe-auth-scram.c b/src/interfaces/libpq/fe-auth-scram.c index 7a8335bf9f8..03f42f06030 100644 --- a/src/interfaces/libpq/fe-auth-scram.c +++ b/src/interfaces/libpq/fe-auth-scram.c @@ -119,6 +119,35 @@ pg_fe_scram_init(PGconn *conn, return state; } +/* + * Return true if channel binding was employed and the scram exchange + * completed. This should be used after a successful exchange to determine + * whether the server authenticated itself to the client. + * + * Note that the caller must also ensure that the exchange was actually + * successful. + */ +bool +pg_fe_scram_channel_bound(void *opaq) +{ + fe_scram_state *state = (fe_scram_state *) opaq; + + /* no SCRAM exchange done */ + if (state == NULL) + return false; + + /* SCRAM exchange not completed */ + if (state->state != FE_SCRAM_FINISHED) + return false; + + /* channel binding mechanism not used */ + if (strcmp(state->sasl_mechanism, SCRAM_SHA_256_PLUS_NAME) != 0) + return false; + + /* all clear! */ + return true; +} + /* * Free SCRAM exchange status */ @@ -225,9 +254,7 @@ pg_fe_scram_exchange(void *opaq, char *input, int inputlen, /* * Verify server signature, to make sure we're talking to the - * genuine server. XXX: A fake server could simply not require - * authentication, though. There is currently no option in libpq - * to reject a connection, if SCRAM authentication did not happen. + * genuine server. */ if (verify_server_signature(state)) *success = true; @@ -358,7 +385,8 @@ build_client_first_message(fe_scram_state *state) appendPQExpBufferStr(&buf, "p=tls-server-end-point"); } #ifdef HAVE_PGTLS_GET_PEER_CERTIFICATE_HASH - else if (conn->ssl_in_use) + else if (conn->channel_binding[0] != 'd' && /* disable */ + conn->ssl_in_use) { /* * Client supports channel binding, but thinks the server does not. @@ -369,7 +397,7 @@ build_client_first_message(fe_scram_state *state) else { /* - * Client does not support channel binding. + * Client does not support channel binding, or has disabled it. */ appendPQEx
Re: allow online change primary_conninfo
Hi, On 2019-09-19 17:46:06 +0300, Sergei Kornilov wrote: > > - This parameter can only be set at server start. > + This parameter can only be set in the > postgresql.conf > + file or on the server command line. >This setting has no effect if the server is not in standby mode. > > + > + If primary_conninfo is changed while WAL > receiver is running, > + the WAL receiver shuts down and then restarts with new setting, > + except when primary_conninfo is an empty string. > + >From the sentence structure it's not clear that "except when primary_conninfo is an empty string" only applies to "and then restarts with new setting". > +/* > + * Need for restart running WalReceiver due the configuration change. > + * Suitable only for XLOG_FROM_STREAM source > + */ > +static bool pendingWalRcvRestart = false; s/due the/due to a/ (or even "due to the"). > @@ -11862,6 +11869,17 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool > randAccess, > if (WalRcvStreaming()) > ShutdownWalRcv(); > > + /* > + * If wal receiver is requested to > restart, we skip the > + * next XLOG_FROM_ARCHIVE to > immediately starting it. > + */ > + if (pendingWalRcvRestart) > + { > + lastSourceFailed = true; > + currentSource = > XLOG_FROM_ARCHIVE; > + continue; > + } I can't parse that comment. What does "skipping to starting" mean? I assume it's just about avoiding wal_retrieve_retry_interval, but I think the comment ought to be rephrased. Also, should we really check this before scanning for new timelines? Why is it the right thing to change to XLOG_FROM_ARCHIVE when we're just restarting walreceiver? The server might unnecessarily get stuck in archive based recovery for a long time this way? It seems to me that we'd need to actually trigger RequestXLogStreaming() in this case. Greetings, Andres Freund
Re: backup manifests
On Fri, Sep 20, 2019 at 11:09 AM David Steele wrote: > Seems to me we are overdue for elog()/ereport() compatible > error-handling in the front end. Plus mem contexts. > > It sucks to make that a prereq for this project but the longer we kick > that can down the road... There are no doubt many patches that would benefit from having more backend infrastructure exposed in frontend contexts, and I think we're slowly moving in that direction, but I generally do not believe in burdening feature patches with major infrastructure improvements. Sometimes it's necessary, as in the case of parallel query, which required upgrading a whole lot of backend infrastructure in order to have any chance of doing something useful. In most cases, however, there's a way of getting the patch done that dodges the problem. For example, I think there's a pretty good argument that Heikki's design for relation forks was a bad one. It's proven to scale poorly and create performance problems and extra complexity in quite a few places. It would likely have been better, from a strictly theoretical point of view, to insist on a design where the FSM and VM pages got stored inside the relation itself, and the heap was responsible for figuring out how various pages were being used. When BRIN came along, we insisted on precisely that design, because it was clear that further straining the relation fork system was not a good plan. However, if we'd insisted on that when Heikki did the original work, it might have delayed the arrival of the free space map for one or more releases, and we got big benefits out of having that done sooner. There's nothing stopping someone from writing a patch to get rid of relation forks and allow a heap AM to have multiple relfilenodes (with the extra ones used for the FSM and VM) or with multiplexing all the data inside of a single file. Nobody has, though, because it's hard, and the problems with the status quo are not so bad as to justify the amount of development effort that would be required to fix it. At some point, that problem is probably going to work its way to the top of somebody's priority list, but it's already been about 10 years since that all happened and everyone has so far dodged dealing with the problem, which in turn has enabled them to work on other things that are perhaps more important. I think the same principle applies here. It's reasonable to ask the author of a feature patch to fix issues that are closely related to the feature in question, or even problems that are not new but would be greatly exacerbated by the addition of the feature. It's not reasonable to stack up a list of infrastructure upgrades that somebody has to do as a condition of having a feature patch accepted that does not necessarily require those upgrades. I am not convinced that JSON is actually a better format for a backup manifest (more on that below), but even if I were, I believe that getting a backup manifest functionality into PostgreSQL 13, and perhaps incremental backup on top of that, is valuable enough to justify making some compromises to make that happen. And I don't mean "compromises" as in "let's commit something that doesn't work very well;" rather, I mean making design choices that are aimed at making the project something that is feasible and can be completed in reasonable time, rather than not. And saying, well, the backup manifest format *has* to be JSON because everything else suxxor is not that. We don't have a single other example of a file that we read and write in JSON format. Extension control files use a custom format. Backup labels and backup history files and timeline history files and tablespace map files use custom formats. postgresql.conf, pg_hba.conf, and pg_ident.conf use custom formats. postmaster.opts and postmaster.pid use custom formats. If JSON is better and easier, at least one of the various people who coded those things up would have chosen to use it, but none of them did, and nobody's made a serious attempt to convert them to use it. That might be because we lack the infrastructure for dealing with JSON and building it is more work than anybody's willing to do, or it might be because JSON is not actually better for these kinds of use cases, but either way, it's hard to see why this particular patch should be burdened with a requirement that none of the previous ones had to satisfy. Personally, I'd be intensely unhappy if a motion to convert postgresql.conf or pg_hba.conf to JSON format gathered enough steam to be adopted. It would be darn useful, because you could specify complex values for options instead of being limited to scalars, but it would also make the configuration files a lot harder for human beings to read and grep and the quality of error reporting would probably decline significantly. Also, appending a setting to the file, something which is currently quite simple, would get a lot harder. Ad-hoc file formats can be problematic, but they can also have real
Re: pgbench - allow to create partitioned tables
Hello Amit, I would not bother to create a patch for so small an improvement. This makes sense in passing because the created function makes it very easy, but otherwise I'll just drop it. I would prefer to drop for now. Attached v13 does that, I added a comment instead. I do not think that it is an improvement. + else + { + fprintf(stderr, "unexpected partition method: \"%s\"\n", ps); + exit(1); + } If we can't catch that earlier, then it might be better to have some version-specific checks rather than such obscure code which is difficult to understand for others. Hmmm. The code simply checks for the current partitioning and fails if the result is unknown, which I understood was what you asked, the previous version was just ignoring the result. The likelyhood of postgres dropping support for range or hash partitions seems unlikely. This issue rather be raised if an older partition-enabled pgbench is run against a newer postgres which adds a new partition method. But then I cannot guess when a new partition method will be added, so I cannot put a guard with a version about something in the future. Possibly, if no new method is ever added, the code will never be triggered. I have made a few modifications in the attached patch. * move the create partitions related code into a separate function. Why not. Not sure it is an improvement. * make the check related to number of partitions consistent i.e check partitions > 0 apart from where we print which I also want to change but let us first discuss one of the above points I switched two instances of >= 1 to > 0, which had 1 instance before. * when we don't found pgbench_accounts table, error out instead of continuing I do not think that it is a a good idea, but I did it anyway to move things forward. * ensure append_fillfactor doesn't assume that it has to append fillfactor and removed fillfactor < 100 check from it. Done, which is too bad. * improve the comments around query to fetch partitions What? How? There are already quite a few comments compared to the length of the query. * improve the comments in the patch and make the code look like nearby code This requirement is to fuzzy. I re-read the changes, and both code and comments look okay to me. * pgindent the patch Done. I think we should try to add some note or comment that why we only choose to partition pgbench_accounts table when the user has given --partitions option. Added as a comment on the initPartition function. -- Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml index c857aa3cba..e3a0abb4c7 100644 --- a/doc/src/sgml/ref/pgbench.sgml +++ b/doc/src/sgml/ref/pgbench.sgml @@ -306,6 +306,31 @@ pgbench options d + + --partitions=NUM + + +Create a partitioned pgbench_accounts table with +NUM partitions of nearly equal size for +the scaled number of accounts. +Default is 0, meaning no partitioning. + + + + + + --partition-method=NAME + + +Create a partitioned pgbench_accounts table with +NAME method. +Expected values are range or hash. +This option requires that --partitions is set to non-zero. +If unspecified, default is range. + + + + --tablespace=tablespace diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c index ed7652bfbf..10eadd8e96 100644 --- a/src/bin/pgbench/pgbench.c +++ b/src/bin/pgbench/pgbench.c @@ -186,6 +186,19 @@ int64 latency_limit = 0; char *tablespace = NULL; char *index_tablespace = NULL; +/* partitioning for pgbench_accounts table, 0 for no partitioning */ +static int partitions = 0; + +typedef enum +{ + PART_NONE, PART_RANGE, PART_HASH +} + + partition_method_t; + +static partition_method_t partition_method = PART_NONE; +static const char *PARTITION_METHOD[] = {"none", "range", "hash"}; + /* random seed used to initialize base_random_sequence */ int64 random_seed = -1; @@ -617,6 +630,9 @@ usage(void) " --foreign-keys create foreign key constraints between tables\n" " --index-tablespace=TABLESPACE\n" " create indexes in the specified tablespace\n" + " --partitions=NUM partition pgbench_accounts in NUM parts (default: 0)\n" + " --partition-method=(range|hash)\n" + " partition pgbench_accounts with this method (default: range)\n" " --tablespace=TABLESPACE create tables in the specified tablespace\n" " --unlogged-tablescreate tables as unlogged tables\n" "\nOptions to select what to run:\n" @@ -3601,6 +3617,80 @@ initDropTables(PGconn *con) "pgbench_tellers"); } +/* + * add fillfactor percent option. + */ +static void +append_fillfactor(char *opts, int len) +{ + /* as default is 100, it could be removed in this case */ + sn
Re: Efficient output for integer types
On Wed, Sep 18, 2019 at 04:27:46PM +0900, Kyotaro Horiguchi wrote: > Hello. > > At Wed, 18 Sep 2019 05:42:01 +0200, David Fetter wrote in > <20190918034201.gx31...@fetter.org> > > On Tue, Sep 17, 2019 at 09:01:57AM +0200, David Fetter wrote: > > > On Tue, Sep 17, 2019 at 08:55:05AM +0200, David Fetter wrote: > > > > On Sun, Sep 15, 2019 at 09:18:49AM +0200, David Fetter wrote: > > > > > Folks, > > > > > > > > > > Please find attached a couple of patches intended to $subject. > > > > > > > > > > This patch set cut the time to copy ten million rows of randomly sized > > > > > int8s (10 of them) by about a third, so at least for that case, it's > > > > > pretty decent. > > > > > > > > Added int4 output, removed the sprintf stuff, as it didn't seem to > > > > help in any cases I was testing. > > > > > > Found a couple of "whiles" that should have been "ifs." > > > > Factored out some inefficient functions and made the guts use the more > > efficient function. > > I'm not sure this is on the KISS principle, but looked it and > have several random comments. > > +numutils.o: CFLAGS += $(PERMIT_DECLARATION_AFTER_STATEMENT) > > I don't think that we are allowing that as project coding > policy. It seems to have been introduced only to accept external > code as-is. Changed to fit current policy. > -charstr[23];/* sign, 21 digits and '\0' */ > +charstr[MAXINT8LEN]; > > It's uneasy that MAXINT8LEN contains tailling NUL. MAXINT8BUFLEN > can be so. I think MAXINT8LEN should be 20 and the definition > should be str[MAXINT8LEN + 1]. Done. > +static const char DIGIT_TABLE[200] = { > + '0', '0', '0', '1', '0', '2', '0', '3', '0', '4', '0', '5', '0', '6', > '0', '7', '0', '8', '0', '9', > > Wouldn't it be simpler if it were defined as a constant string? > > static const char DIGIT_TABLE[201] = > "00010203040519" > "20212223242539" > .. I thought this might be even clearer: "00" "01" "02" "03" "04" "05" "06" "07" "08" "09" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30" "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42" "43" "44" "45" "46" "47" "48" "49" "50" "51" "52" "53" "54" "55" "56" "57" "58" "59" "60" "61" "62" "63" "64" "65" "66" "67" "68" "69" "70" "71" "72" "73" "74" "75" "76" "77" "78" "79" "80" "81" "82" "83" "84" "85" "86" "87" "88" "89" "90" "91" "92" "93" "94" "95" "96" "97" "98" "99"; > +pg_ltoa_n(int32 value, char *a) > ... > + /* Compute the result string. */ > + while (value >= 1) > > We have only two degits above the value. Isn't the stuff inside > the while a waste of cycles? Changed the while to an if. > + /* Expensive 64-bit division. Optimize? */ > > I believe compiler treats such trivial optimizations. (concretely > converts into shifts and subtractons if faster.) Comments removed. > + memcpy(a + olength - i - 2, DIGIT_TABLE + c0, 2); > > Maybe it'd be easy to read if 'a + olength - i' is a single variable. Done. > + i += adjust; > + return i; > > If 'a + olength - i' is replaced with a variable, the return > statement is replacable with "return olength + adjust;". I'm not sure I understand this. > + return t + (v >= PowersOfTen[t]); > > I think it's better that if it were 't - (v < POT[t]) + 1; /* > log10(v) + 1 */'. At least we need an explanation of the > difference. (I'didn't checked the algorithm is truely right, > though.) Comments added. > > void > > pg_lltoa(int64 value, char *a) > > { > .. > > memcpy(a, "-9223372036854775808", 21); > .. > > memcpy(a, "0", 2); > > The lines need a comment like "/* length contains trailing '\0' > */" Comments added. > + if (value >= 0) > ... > + else > + { > + if (value == PG_INT32_MIN) > + memcpy(str, "-2147483648", 11); > + return str + 11; > > } > + *str++ = '-'; > + return pg_ltostr_zeropad(str, -value, minwidth - 1); > > If then block of the if statement were (values < 0), we won't > need to reenter the functaion. This is a tail-call recursion, so it's probably optimized already. > + len = pg_ltoa_n(value, str); > + if (minwidth <= len) > + return str + len; > + > + memmove(str + minwidth - len, str, len); > > If the function had the parameters str with the room only for two > digits and a NUL, 2 as minwidth but 1000 as value, the function > would overrun the buffer. The original function just ignores > overflowing digits. I believe the original was incorrect. Best, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate >From 30d8a2b1bd2f6094a5cb4dd8acb9cc36c837802a Mon Sep 17 00:00:00 2001 From: David Fetter Date: Sun, 15 Sep 2019 00:
Re: log bind parameter values on error
Hi, thanks for looking. On 2019-Sep-20, Andres Freund wrote: > On 2019-09-18 17:58:53 -0300, Alvaro Herrera wrote: > > + > xreflabel="log_parameters_on_error"> > > + log_parameters_on_error > > (boolean) > > + > > + log_parameters_on_error configuration > > parameter > > + > > + > > + > > + > > +Controls whether the statement is logged with bind parameter > > values. > > Trailing whitespace. > > I find the "the statement" formulation a bit odd, but can't quite put my > finger on why. Yeah, I think that wording is pretty confusing. I would use "Controls whether bind parameters are logged when a statement is logged." > > +/* > > + * The top-level portal that the client is immediately working with: > > + * creating, binding, executing, or all at one using simple protocol > > + */ > > +Portal current_top_portal = NULL; > > + > > This strikes me as decidedly not nice. For one this means that we'll not > be able to use this infrastructure for binds that are done serverside, > e.g. as part of plpgsql. I'm basically inclined to think that > integrating this at the postges.c level is the wrong thing. [...] > It seems to me that this would really need to be tracked inside the > portal infrastructure. I think that's how this was done at first, then Peter E drove him away from that into the current design. > It also adds new error handling complexity, which is already quite > substantial. And spreads knowledge of portals to elog.c, which imo > shouldn't have to know about them. Makes sense. > > + appendStringInfoCharMacro(¶m_str, '\''); > > + for (p = pstring; *p; p++) > > + { > > + if (*p == '\'') /* double single quotes */ > > + appendStringInfoCharMacro(¶m_str, *p); > > + appendStringInfoCharMacro(¶m_str, *p); > > + } > > + appendStringInfoCharMacro(¶m_str, '\''); > > I know this is old code, but: This is really inefficient. Will cause a > lot of unnecessary memory-reallocations for large text outputs, because > we don't immediately grow to at least close to the required size. Agreed, but we can't blame a patch because it moves around some old crufty code. If Alexey wants to include another patch to change this to a better formulation, I'm happy to push that before or after his main patch. And if he doesn't want to, that's fine with me too. (Is doing a strlen first to enlarge the stringinfo an overall better approach?) (I wonder if it would make sense to strchr each ' and memcpy the intervening bytes ... if only that didn't break the SI abstraction completely ...) -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Write visibility map during CLUSTER/VACUUM FULL
Hi, On 2019-09-13 22:22:50 +0300, Alexander Korotkov wrote: > On Thu, Sep 12, 2019 at 4:55 PM Alexander Korotkov > wrote: > > On Wed, Sep 11, 2019 at 3:30 PM Amit Kapila wrote: > > > On Sun, Sep 1, 2019 at 1:37 PM Alexander Korotkov > > > wrote: > > > > I found it weird that CLUSTER/VACUUM FULL don't write visibility map. > > > > Attached patch implements writing visibility map in > > > > heapam_relation_copy_for_cluster(). > > > > > > > > I've studied previous attempt to implement this [1]. The main problem > > > > of that attempt was usage of existing heap_page_is_all_visible() and > > > > visibilitymap_set() functions. These functions works through buffer > > > > manager, while heap rewriting is made bypass buffer manager. > > > > > > > > In my patch visibility map pages are handled in the same way as heap > > > > pages are. Why do you want to do that? To me that doesn't immediately seems like a good idea, and I've not seen justification for it in this thread. Did I miss something? > > > I haven't studied this patch in detail, but while glancing I observed > > > that this doesn't try to sync the vm pages as we do for heap pages in > > > the end (during end_heap_rewrite). Am I missing something? > > > > You're not missed anything. Yes, VM need sync. Will fix this. And I > > just noticed I need a closer look to what is going on with TOAST. > > Attached patch syncs VM during end_heap_rewrite(). > > However, VM for TOAST still isn't read. It appear to be much more > difficult to write VM for TOAST, because it's written by insertion > tuples one-by-one. Despite it seems to fill TOAST heap pages > sequentially (assuming no FSM exists yet), it's quite hard to handle > page-switching event with reasonable level of abstraction. > Nevertheless, I find this patch useful in current shape. Even if we > don't write VM for TOAST, it's still useful to do for regular heap. > Additionally, one of key advantages of having VM is index-only scan, > which don't work for TOAST anyway. Have you looked at the discussion around https://www.postgresql.org/message-id/20190408010427.4l63qr7h2fjcyp77%40alap3.anarazel.de ? That's not quite the same thing as you're tackling here, but I wonder if some of the logic could be the same. Especially for toast. > +/* Write contents of VM page */ > +static void > +rewrite_flush_vm_page(RewriteState state) > +{ > + Assert(state->rs_vm_buffer_valid); > + > + if (state->rs_use_wal) > + log_newpage(&state->rs_new_rel->rd_node, > + VISIBILITYMAP_FORKNUM, > + state->rs_vm_blockno, > + state->rs_vm_buffer, > + true); > + RelationOpenSmgr(state->rs_new_rel); > + > + PageSetChecksumInplace(state->rs_vm_buffer, state->rs_vm_blockno); > + > + smgrextend(state->rs_new_rel->rd_smgr, VISIBILITYMAP_FORKNUM, > +state->rs_vm_blockno, (char *) state->rs_vm_buffer, > true); > + > + state->rs_vm_buffer_valid = false; > +} > + > +/* Set VM flags to the VM page */ > +static void > +rewrite_set_vm_flags(RewriteState state) > +{ > + BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(state->rs_blockno); > + uint32 mapByte = HEAPBLK_TO_MAPBYTE(state->rs_blockno); > + uint8 mapOffset = HEAPBLK_TO_OFFSET(state->rs_blockno); > + char *map; > + uint8 flags; > + > + if (mapBlock != state->rs_vm_blockno && state->rs_vm_buffer_valid) > + rewrite_flush_vm_page(state); > + > + if (!state->rs_vm_buffer_valid) > + { > + PageInit(state->rs_vm_buffer, BLCKSZ, 0); > + state->rs_vm_blockno = mapBlock; > + state->rs_vm_buffer_valid = true; > + } > + > + flags = (state->rs_all_visible ? VISIBILITYMAP_ALL_VISIBLE : 0) | > + (state->rs_all_frozen ? VISIBILITYMAP_ALL_FROZEN : 0); > + > + map = PageGetContents(state->rs_vm_buffer); > + map[mapByte] |= (flags << mapOffset); > +} I think it's a bad idea to copy this many VM implementation details into rewriteheap.c. > +/* > + * Update rs_all_visible and rs_all_frozen flags according to the tuple. We > + * use simplified check assuming that HeapTupleSatisfiesVacuum() should > already > + * set tuple hint bits. > + */ > +static void > +rewrite_update_vm_flags(RewriteState state, HeapTuple tuple) > +{ > + TransactionId xmin; > + > + if (!state->rs_all_visible) > + return; > + > + if (!HeapTupleHeaderXminCommitted(tuple->t_data)) > + { > + state->rs_all_visible = false; > + state->rs_all_frozen = false; > + return; > + } > + > + xmin = HeapTupleHeaderGetXmin(tuple->t_data); > + if (!TransactionIdPrecedes(xmin, state->rs_oldest_xmin)) > + { > + state->rs_all_visible = false; > + state->rs_all_frozen = false; > +
Re: Unwanted expression simplification in PG12b2
On Wed, Jul 17, 2019 at 5:20 PM Darafei "Komяpa" Praliaskouski wrote: > Indeed, it seems I failed to minimize my example. > > Here is the actual one, on 90GB table with 16M rows: > https://gist.github.com/Komzpa/8d5b9008ad60f9ccc62423c256e78b4c > > I can share the table on request if needed, but hope that plan may be enough. [ replying to an old thread ] I think that this boils down to a lack of planner smarts about target lists. The planner currently assumes that any given relation - which for planner purposes might be an actual table or might be the result of joining multiple tables, aggregating something, running a subquery, etc. - more or less has one thing that it's supposed to produce. It only tries to generate plans that produce that target list. There's some support in there for the idea that there might be various paths for the same relation that produce different answers, but I don't know of that actually being used anywhere (but it might be). What I taught the planner to do here had to do with making the costing more accurate for cases like this. It now figures out that if it's going to stick a Gather in at that point, computing the expressions below the Gather rather than above the Gather makes a difference to the cost of that plan vs. other plans. However, it still doesn't consider any more paths than it did before; it just costs them more accurately. In your first example, I believe that the planner should be able to consider both GroupAggregate -> Gather Merge -> Sort -> Parallel Seq Scan and GroupAggregate -> Sort -> Gather -> Parallel Seq Scan, but I think it's got a fixed idea about which fields should be fed into the Sort. In particular, I believe it thinks that sorting more data is so undesirable that it doesn't want to carry any unnecessary baggage through the Sort for any reason. To solve this problem, I think it would need to cost the second plan with projection done both before the Sort and after the Sort and decide which one was cheaper. This class of problem is somewhat annoying in that the extra planner cycles and complexity to deal with getting this right would be useless for many queries, but at the same time, there are a few cases where it can win big. I don't know what to do about that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: subscriptionCheck failures on nightjar
Andres Freund writes: > Since now a number of people (I tried as well), failed to reproduce this > locally, I propose that we increase the log-level during this test on > master. And perhaps expand the set of debugging information. With the > hope that the additional information on the cases encountered on the bf > helps us build a reproducer or, even better, diagnose the issue > directly. If people agree, I'll come up with a patch. I recreated my freebsd-9-under-qemu setup and I can still reproduce the problem, though not with high reliability (order of 1 time in 10). Anything particular you want logged? regards, tom lane
Re: log bind parameter values on error
Hi, On 2019-09-20 16:56:57 -0300, Alvaro Herrera wrote: > > > +/* > > > + * The top-level portal that the client is immediately working with: > > > + * creating, binding, executing, or all at one using simple protocol > > > + */ > > > +Portal current_top_portal = NULL; > > > + > > > > This strikes me as decidedly not nice. For one this means that we'll not > > be able to use this infrastructure for binds that are done serverside, > > e.g. as part of plpgsql. I'm basically inclined to think that > > integrating this at the postges.c level is the wrong thing. > [...] > > It seems to me that this would really need to be tracked inside the > > portal infrastructure. > > I think that's how this was done at first, then Peter E drove him away > from that into the current design. I don't think it really was the way I am suggesting. There were a bunch of helper functions managing current_top_portal, but it otherwise is (and afaict was in all versions) still all postgres.c controlled. Whereas I think it should be nearly exclusively be handled by pquery.c, with the possible exception of an efficiency hack to reuse client input string when they're in text format. What I'm suggesting is that PortalStart() would build a string representation out of the parameter list (using ParamExternData.textValue if set, calling the output function otherwise), and stash that in the portal. At the start of all the already existing PG_TRY blocks in pquery.c we push an element onto the error_context_stack that adds the errdetail with the parameters to the error. Alternatively we could also add them in in the PG_CATCH blocks, but that'd only work for elevel == ERROR (i.e. neither FATAL nor non throwing log levels would be able to enrich the error). Thinking about this: I think the current approach doesn't actually handle recursive errors correctly. Even if we fail to emit the error message due to the parameter details requiring a lot of memory, we'd again do so (and again fail) when handling that OOM error, because current_top_portal afaict would still be set. At the very least this'd need to integrate with the recursion_depth logic in elog.c. > > > + appendStringInfoCharMacro(¶m_str, '\''); > > > + for (p = pstring; *p; p++) > > > + { > > > + if (*p == '\'') /* double single quotes */ > > > + appendStringInfoCharMacro(¶m_str, *p); > > > + appendStringInfoCharMacro(¶m_str, *p); > > > + } > > > + appendStringInfoCharMacro(¶m_str, '\''); > > > > I know this is old code, but: This is really inefficient. Will cause a > > lot of unnecessary memory-reallocations for large text outputs, because > > we don't immediately grow to at least close to the required size. > > Agreed, but we can't blame a patch because it moves around some old > crufty code. If Alexey wants to include another patch to change this to > a better formulation, I'm happy to push that before or after his main > patch. And if he doesn't want to, that's fine with me too. Well, this patch makes it potentially a considerably hotter path, so I think there's some justification for pushing a bit. But I'd not require it either. As I said, I think we cannot generate this string at error time, because it makes it much much more likely to exhaust the error context - a bad thing. > (Is doing a strlen first to enlarge the stringinfo an overall better > approach?) Yes, it'd be better. > (I wonder if it would make sense to strchr each ' and memcpy the > intervening bytes ... if only that didn't break the SI abstraction > completely ...) I'd probably just count the ' in one pass, enlarge the stringinfo to the required size, and then put the string directly into he stringbuffer. Possibly just putting the necessary code into stringinfo.c. We already have multiple copies of this inefficient logic... But even if not, I don't think writing data into the stringbuf directly is that ugly, we do that in enough places that I'd argue that that's basically part of how it's expected to be used. In HEAD there's at least - postgres.c:errdetail_params(), - pl_exec.c:format_preparedparamsdata() pl_exec.c:format_expr_params() and - dblink.c:escape_param_str() - deparse.c:deparseStringLiteral() - xlog.c:do_pg_start_backup() (after "Add the escape character"), - tsearchcmds.c:serialize_deflist() - ruleutils.c:simple_quote_literal() are nearly the same. Greetings, Andres Freund
Wrong results using initcap() with non normalized string
Hello, I have come around a strange situation when using a unicode string that has non normalized characters. The attached script 'initcap.sql' can reproduce the problem. The attached patch can fix the issue. Regards, Juan José Santamaría Flecha initcap.sql Description: application/sql diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c index 755ca6e..9f8becf 100644 --- a/src/backend/utils/adt/formatting.c +++ b/src/backend/utils/adt/formatting.c @@ -96,6 +96,7 @@ #include "utils/memutils.h" #include "utils/numeric.h" #include "utils/pg_locale.h" +#include "common/unicode_norm.h" /* -- * Routines type @@ -1864,7 +1865,8 @@ str_initcap(const char *buff, size_t nbytes, Oid collid) workspace[curr_char] = towlower_l(workspace[curr_char], mylocale->info.lt); else workspace[curr_char] = towupper_l(workspace[curr_char], mylocale->info.lt); - wasalnum = iswalnum_l(workspace[curr_char], mylocale->info.lt); + if (!is_pg_wchar_combining(workspace[curr_char])) + wasalnum = iswalnum_l(workspace[curr_char], mylocale->info.lt); } else #endif @@ -1873,7 +1875,8 @@ str_initcap(const char *buff, size_t nbytes, Oid collid) workspace[curr_char] = towlower(workspace[curr_char]); else workspace[curr_char] = towupper(workspace[curr_char]); - wasalnum = iswalnum(workspace[curr_char]); + if (!is_pg_wchar_combining(workspace[curr_char])) + wasalnum = iswalnum(workspace[curr_char]); } } diff --git a/src/common/unicode_norm.c b/src/common/unicode_norm.c index 89c5533..25b149b 100644 --- a/src/common/unicode_norm.c +++ b/src/common/unicode_norm.c @@ -435,3 +435,14 @@ unicode_normalize_kc(const pg_wchar *input) return recomp_chars; } + +bool +is_pg_wchar_combining(const pg_wchar current) +{ + pg_unicode_decomposition *currEntry = get_code_entry(current); + if (currEntry == NULL) + return false; + if (currEntry->comb_class == 0x0) + return false; + return true; +} diff --git a/src/include/common/unicode_norm.h b/src/include/common/unicode_norm.h index 99167d2..bdcf02e 100644 --- a/src/include/common/unicode_norm.h +++ b/src/include/common/unicode_norm.h @@ -17,5 +17,6 @@ #include "mb/pg_wchar.h" extern pg_wchar *unicode_normalize_kc(const pg_wchar *input); +extern bool is_pg_wchar_combining(const pg_wchar current); #endif /* UNICODE_NORM_H */
Re: Efficient output for integer types
On Fri, Sep 20, 2019 at 09:14:51PM +0200, David Fetter wrote: > On Wed, Sep 18, 2019 at 04:27:46PM +0900, Kyotaro Horiguchi wrote: > > Hello. > > > > At Wed, 18 Sep 2019 05:42:01 +0200, David Fetter wrote > > in <20190918034201.gx31...@fetter.org> > > > On Tue, Sep 17, 2019 at 09:01:57AM +0200, David Fetter wrote: > > > > On Tue, Sep 17, 2019 at 08:55:05AM +0200, David Fetter wrote: > > > > > On Sun, Sep 15, 2019 at 09:18:49AM +0200, David Fetter wrote: > > > > > > Folks, > > > > > > > > > > > > Please find attached a couple of patches intended to $subject. > > > > > > > > > > > > This patch set cut the time to copy ten million rows of randomly > > > > > > sized > > > > > > int8s (10 of them) by about a third, so at least for that case, it's > > > > > > pretty decent. > > > > > > > > > > Added int4 output, removed the sprintf stuff, as it didn't seem to > > > > > help in any cases I was testing. > > > > > > > > Found a couple of "whiles" that should have been "ifs." > > > > > > Factored out some inefficient functions and made the guts use the more > > > efficient function. > > > > I'm not sure this is on the KISS principle, but looked it and > > have several random comments. > > > > +numutils.o: CFLAGS += $(PERMIT_DECLARATION_AFTER_STATEMENT) Oops. Missed a few. Best, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate >From f7f9cb42ce9d50e5a4746fe80208960e29ffd348 Mon Sep 17 00:00:00 2001 From: David Fetter Date: Sun, 15 Sep 2019 00:06:29 -0700 Subject: [PATCH v8] Make int4 and int8 operations more efficent To: hackers MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="2.21.0" This is a multi-part message in MIME format. --2.21.0 Content-Type: text/plain; charset=UTF-8; format=fixed Content-Transfer-Encoding: 8bit - Output routines now do more digits per iteration, and - Code determines the number of decimal digits in int4/int8 efficiently - Split off pg_ltoa_n from pg_ltoa - Use same to make other functions shorter diff --git a/src/backend/access/common/printsimple.c b/src/backend/access/common/printsimple.c index 651ade14dd..5c5b6d33b2 100644 --- a/src/backend/access/common/printsimple.c +++ b/src/backend/access/common/printsimple.c @@ -112,7 +112,7 @@ printsimple(TupleTableSlot *slot, DestReceiver *self) case INT8OID: { int64 num = DatumGetInt64(value); - char str[23]; /* sign, 21 digits and '\0' */ + char str[MAXINT8LEN + 1]; pg_lltoa(num, str); pq_sendcountedtext(&buf, str, strlen(str), false); diff --git a/src/backend/utils/adt/int8.c b/src/backend/utils/adt/int8.c index 0ff9394a2f..6230807906 100644 --- a/src/backend/utils/adt/int8.c +++ b/src/backend/utils/adt/int8.c @@ -27,8 +27,6 @@ #include "utils/builtins.h" -#define MAXINT8LEN 25 - typedef struct { int64 current; diff --git a/src/backend/utils/adt/numutils.c b/src/backend/utils/adt/numutils.c index 70138feb29..fef6da672b 100644 --- a/src/backend/utils/adt/numutils.c +++ b/src/backend/utils/adt/numutils.c @@ -20,6 +20,64 @@ #include "common/int.h" #include "utils/builtins.h" +#include "port/pg_bitutils.h" + +/* + * A table of all two-digit numbers. This is used to speed up decimal digit + * generation by copying pairs of digits into the final output. + */ +static const char DIGIT_TABLE[200] = +"00" "01" "02" "03" "04" "05" "06" "07" "08" "09" +"10" "11" "12" "13" "14" "15" "16" "17" "18" "19" +"20" "21" "22" "23" "24" "25" "26" "27" "28" "29" +"30" "31" "32" "33" "34" "35" "36" "37" "38" "39" +"40" "41" "42" "43" "44" "45" "46" "47" "48" "49" +"50" "51" "52" "53" "54" "55" "56" "57" "58" "59" +"60" "61" "62" "63" "64" "65" "66" "67" "68" "69" +"70" "71" "72" "73" "74" "75" "76" "77" "78" "79" +"80" "81" "82" "83" "84" "85" "86" "87" "88" "89" +"90" "91" "92" "93" "94" "95" "96" "97" "98" "99"; + +/* + * Adapted from http://graphics.stanford.edu/~seander/bithacks.html#IntegerLog10 + */ +static inline uint32 +decimalLength32(const uint32 v) +{ + uint32 t; + static uint64 PowersOfTen[] = + {1,10,100, + 1000, 1, 10, + 100, 1000, 1, + 10}; + /* + * Compute base-10 logarithm by dividing the base-2 logarithm + * by a good-enough approximation of the base-2 logarithm of 10 + */ + t = (pg_leftmost_one_pos32(v) + 1)*1233/4096; + return t + (v >= PowersOfTen[t]); +} + +static inline uint32 +decimalLength64(const uint64 v) +{ + uint32 t; + static uint64 PowersOfTen[] = + {1,10,100, + 1000, 1, 10, + 100, 1000, 1, + 10, 100, 1000, + 1,10,100, + 1000, 1, 10, + 100}; + + /* + * Compute base-10 logar
Re: Efficient output for integer types
On Fri, Sep 20, 2019 at 11:09:16PM +0200, David Fetter wrote: > On Fri, Sep 20, 2019 at 09:14:51PM +0200, David Fetter wrote: > > On Wed, Sep 18, 2019 at 04:27:46PM +0900, Kyotaro Horiguchi wrote: > > > Hello. > > > > > > At Wed, 18 Sep 2019 05:42:01 +0200, David Fetter wrote > > > in <20190918034201.gx31...@fetter.org> > > > > On Tue, Sep 17, 2019 at 09:01:57AM +0200, David Fetter wrote: > > > > > On Tue, Sep 17, 2019 at 08:55:05AM +0200, David Fetter wrote: > > > > > > On Sun, Sep 15, 2019 at 09:18:49AM +0200, David Fetter wrote: > > > > > > > Folks, > > > > > > > > > > > > > > Please find attached a couple of patches intended to $subject. > > > > > > > > > > > > > > This patch set cut the time to copy ten million rows of randomly > > > > > > > sized > > > > > > > int8s (10 of them) by about a third, so at least for that case, > > > > > > > it's > > > > > > > pretty decent. > > > > > > > > > > > > Added int4 output, removed the sprintf stuff, as it didn't seem to > > > > > > help in any cases I was testing. > > > > > > > > > > Found a couple of "whiles" that should have been "ifs." > > > > > > > > Factored out some inefficient functions and made the guts use the more > > > > efficient function. > > > > > > I'm not sure this is on the KISS principle, but looked it and > > > have several random comments. > > > > > > +numutils.o: CFLAGS += $(PERMIT_DECLARATION_AFTER_STATEMENT) > > Oops. Missed a few. D'oh! Wrong patch. Best, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate >From 51d793143daf44cc0f01a00d6aedeaf3fdd88230 Mon Sep 17 00:00:00 2001 From: David Fetter Date: Sun, 15 Sep 2019 00:06:29 -0700 Subject: [PATCH v9] Make int4 and int8 operations more efficent To: hackers MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="2.21.0" This is a multi-part message in MIME format. --2.21.0 Content-Type: text/plain; charset=UTF-8; format=fixed Content-Transfer-Encoding: 8bit - Output routines now do more digits per iteration, and - Code determines the number of decimal digits in int4/int8 efficiently - Split off pg_ltoa_n from pg_ltoa - Use same to make other functions shorter diff --git a/src/backend/access/common/printsimple.c b/src/backend/access/common/printsimple.c index 651ade14dd..5c5b6d33b2 100644 --- a/src/backend/access/common/printsimple.c +++ b/src/backend/access/common/printsimple.c @@ -112,7 +112,7 @@ printsimple(TupleTableSlot *slot, DestReceiver *self) case INT8OID: { int64 num = DatumGetInt64(value); - char str[23]; /* sign, 21 digits and '\0' */ + char str[MAXINT8LEN + 1]; pg_lltoa(num, str); pq_sendcountedtext(&buf, str, strlen(str), false); diff --git a/src/backend/utils/adt/int8.c b/src/backend/utils/adt/int8.c index 0ff9394a2f..6230807906 100644 --- a/src/backend/utils/adt/int8.c +++ b/src/backend/utils/adt/int8.c @@ -27,8 +27,6 @@ #include "utils/builtins.h" -#define MAXINT8LEN 25 - typedef struct { int64 current; diff --git a/src/backend/utils/adt/numutils.c b/src/backend/utils/adt/numutils.c index 70138feb29..630aafd168 100644 --- a/src/backend/utils/adt/numutils.c +++ b/src/backend/utils/adt/numutils.c @@ -20,6 +20,64 @@ #include "common/int.h" #include "utils/builtins.h" +#include "port/pg_bitutils.h" + +/* + * A table of all two-digit numbers. This is used to speed up decimal digit + * generation by copying pairs of digits into the final output. + */ +static const char DIGIT_TABLE[200] = +"00" "01" "02" "03" "04" "05" "06" "07" "08" "09" +"10" "11" "12" "13" "14" "15" "16" "17" "18" "19" +"20" "21" "22" "23" "24" "25" "26" "27" "28" "29" +"30" "31" "32" "33" "34" "35" "36" "37" "38" "39" +"40" "41" "42" "43" "44" "45" "46" "47" "48" "49" +"50" "51" "52" "53" "54" "55" "56" "57" "58" "59" +"60" "61" "62" "63" "64" "65" "66" "67" "68" "69" +"70" "71" "72" "73" "74" "75" "76" "77" "78" "79" +"80" "81" "82" "83" "84" "85" "86" "87" "88" "89" +"90" "91" "92" "93" "94" "95" "96" "97" "98" "99"; + +/* + * Adapted from http://graphics.stanford.edu/~seander/bithacks.html#IntegerLog10 + */ +static inline uint32 +decimalLength32(const uint32 v) +{ + uint32 t; + static uint64 PowersOfTen[] = + {1,10,100, + 1000, 1, 10, + 100, 1000, 1, + 10}; + /* + * Compute base-10 logarithm by dividing the base-2 logarithm + * by a good-enough approximation of the base-2 logarithm of 10 + */ + t = (pg_leftmost_one_pos32(v) + 1)*1233/4096; + return t + (v >= PowersOfTen[t]); +} + +static inline uint32 +decimalLength64(const uint64 v) +{ + uint32 t; + static uint64 PowersOfTen[] = + {1,10,100, + 1000, 1, 10, + 100, 1000, 1, + 10, 100, 1000, + 10
Re: subscriptionCheck failures on nightjar
Hi, On 2019-09-20 16:25:21 -0400, Tom Lane wrote: > Andres Freund writes: > > Since now a number of people (I tried as well), failed to reproduce this > > locally, I propose that we increase the log-level during this test on > > master. And perhaps expand the set of debugging information. With the > > hope that the additional information on the cases encountered on the bf > > helps us build a reproducer or, even better, diagnose the issue > > directly. If people agree, I'll come up with a patch. > > I recreated my freebsd-9-under-qemu setup and I can still reproduce > the problem, though not with high reliability (order of 1 time in 10). > Anything particular you want logged? A DEBUG2 log would help a fair bit, because it'd log some information about what changes the "horizons" determining when data may be removed. Perhaps with the additional elogs attached? I lowered some messages to DEBUG2 so we don't have to suffer the noise of the ipc.c DEBUG3 messages. If I use a TEMP_CONFIG setting log_min_messages=DEBUG2 with the patches applied, the subscription tests still pass. I hope they still fail on your setup, even though the increased logging volume probably changes timing somewhat. Greetings, Andres Freund diff --git i/src/backend/access/transam/xlog.c w/src/backend/access/transam/xlog.c index b7ff004234a..0d48ef3f039 100644 --- i/src/backend/access/transam/xlog.c +++ w/src/backend/access/transam/xlog.c @@ -2676,9 +2676,16 @@ XLogSetAsyncXactLSN(XLogRecPtr asyncXactLSN) void XLogSetReplicationSlotMinimumLSN(XLogRecPtr lsn) { + XLogRecPtr old_lsn; + SpinLockAcquire(&XLogCtl->info_lck); + old_lsn = XLogCtl->replicationSlotMinLSN; XLogCtl->replicationSlotMinLSN = lsn; SpinLockRelease(&XLogCtl->info_lck); + + elog(DEBUG2, "updating slot minimum lsn from %X/%X to %X/%X", + (uint32) (old_lsn >> 32), (uint32) old_lsn, + (uint32) (lsn >> 32), (uint32) lsn); } diff --git i/src/backend/replication/logical/snapbuild.c w/src/backend/replication/logical/snapbuild.c index 0bd1d0f9545..f7d270340de 100644 --- i/src/backend/replication/logical/snapbuild.c +++ w/src/backend/replication/logical/snapbuild.c @@ -913,7 +913,7 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder) memcpy(builder->committed.xip, workspace, surviving_xids * sizeof(TransactionId)); - elog(DEBUG3, "purged committed transactions from %u to %u, xmin: %u, xmax: %u", + elog(DEBUG2, "purged committed transactions from %u to %u, xmin: %u, xmax: %u", (uint32) builder->committed.xcnt, (uint32) surviving_xids, builder->xmin, builder->xmax); builder->committed.xcnt = surviving_xids; @@ -1140,7 +1140,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact xmin = ReorderBufferGetOldestXmin(builder->reorder); if (xmin == InvalidTransactionId) xmin = running->oldestRunningXid; - elog(DEBUG3, "xmin: %u, xmax: %u, oldest running: %u, oldest xmin: %u", + elog(DEBUG2, "xmin: %u, xmax: %u, oldest running: %u, oldest xmin: %u", builder->xmin, builder->xmax, running->oldestRunningXid, xmin); LogicalIncreaseXminForSlot(lsn, xmin); @@ -1952,6 +1952,10 @@ CheckPointSnapBuild(void) /* now check for the restart ptrs from existing slots */ cutoff = ReplicationSlotsComputeLogicalRestartLSN(); + elog(DEBUG2, "doing snapbuild checkpoint with restart lsn %X/%X, redo %X/%X", + (uint32) (cutoff >> 32), (uint32) cutoff, + (uint32) (redo >> 32), (uint32) redo); + /* don't start earlier than the restart lsn */ if (redo < cutoff) cutoff = redo; diff --git i/src/backend/storage/ipc/procarray.c w/src/backend/storage/ipc/procarray.c index 8abcfdf841f..9e8a93e5962 100644 --- i/src/backend/storage/ipc/procarray.c +++ w/src/backend/storage/ipc/procarray.c @@ -2981,16 +2981,24 @@ void ProcArraySetReplicationSlotXmin(TransactionId xmin, TransactionId catalog_xmin, bool already_locked) { + TransactionId old_xmin; + TransactionId old_catalog_xmin; + Assert(!already_locked || LWLockHeldByMe(ProcArrayLock)); if (!already_locked) LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE); + old_xmin = procArray->replication_slot_xmin; + old_catalog_xmin = procArray->replication_slot_catalog_xmin; procArray->replication_slot_xmin = xmin; procArray->replication_slot_catalog_xmin = catalog_xmin; if (!already_locked) LWLockRelease(ProcArrayLock); + + elog(DEBUG2, "updating slot xmin/catalog_xmin from %u/%u to %u/%u", + old_xmin, old_catalog_xmin, xmin, catalog_xmin); } /*
Re: subscriptionCheck failures on nightjar
Andres Freund writes: > On 2019-09-20 16:25:21 -0400, Tom Lane wrote: >> I recreated my freebsd-9-under-qemu setup and I can still reproduce >> the problem, though not with high reliability (order of 1 time in 10). >> Anything particular you want logged? > A DEBUG2 log would help a fair bit, because it'd log some information > about what changes the "horizons" determining when data may be removed. Actually, what I did was as attached [1], and I am getting traces like [2]. The problem seems to occur only when there are two or three processes concurrently creating the same snapshot file. It's not obvious from the debug trace, but the snapshot file *does* exist after the music stops. It is very hard to look at this trace and conclude anything other than "rename(2) is broken, it's not atomic". Nothing in our code has deleted the file: no checkpoint has started, nor do we see the DEBUG1 output that CheckPointSnapBuild ought to produce. But fsync_fname momentarily can't see it (and then later another process does see it). It is now apparent why we're only seeing this on specific ancient platforms. I looked around for info about rename(2) not being atomic, and I found this info about FreeBSD: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=94849 The reported symptom there isn't quite the same, so probably there is another issue, but there is plenty of reason to be suspicious that UFS rename(2) is buggy in this release. As for dromedary's ancient version of macOS, Apple is exceedinly untransparent about their bugs, but I found http://www.weirdnet.nl/apple/rename.html In short, what we got here is OS bugs that have probably been resolved years ago. The question is what to do next. Should we just retire these specific buildfarm critters, or do we want to push ahead with getting rid of the PANIC here? regards, tom lane
Re: subscriptionCheck failures on nightjar
Sigh, forgot about attaching the attachments ... regards, tom lane diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c index 0bd1d0f..53fd33c 100644 --- a/src/backend/replication/logical/snapbuild.c +++ b/src/backend/replication/logical/snapbuild.c @@ -1670,11 +1670,14 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn) errmsg("could not rename file \"%s\" to \"%s\": %m", tmppath, path))); } + elog(DEBUG1, "renamed serialized snapshot %s to %s", tmppath, path); /* make sure we persist */ fsync_fname(path, false); fsync_fname("pg_logical/snapshots", true); + elog(DEBUG1, "fsynced %s", path); + /* * Now there's no way we can loose the dumped state anymore, remember this * as a serialization point. diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl index be2c0bd..2986582 100644 --- a/src/test/subscription/t/010_truncate.pl +++ b/src/test/subscription/t/010_truncate.pl @@ -9,6 +9,11 @@ use Test::More tests => 9; my $node_publisher = get_new_node('publisher'); $node_publisher->init(allows_streaming => 'logical'); +$node_publisher->append_conf('postgresql.conf', q{ +log_checkpoints = on +log_min_messages = 'debug1' +log_error_verbosity = verbose +}); $node_publisher->start; my $node_subscriber = get_new_node('subscriber'); 2019-09-20 17:08:54.361 EDT [34418] DEBUG: 0: registering background worker "logical replication launcher" 2019-09-20 17:08:54.361 EDT [34418] LOCATION: RegisterBackgroundWorker, bgworker.c:855 2019-09-20 17:08:54.362 EDT [34418] LOG: 0: starting PostgreSQL 13devel on x86_64-unknown-freebsd9.0, compiled by gcc (GCC) 4.2.1 20070831 patched [FreeBSD], 64-bit 2019-09-20 17:08:54.362 EDT [34418] LOCATION: PostmasterMain, postmaster.c:1104 2019-09-20 17:08:54.362 EDT [34418] LOG: 0: listening on Unix socket "/tmp/2lehMLoBNn/.s.PGSQL.65366" 2019-09-20 17:08:54.362 EDT [34418] LOCATION: StreamServerPort, pqcomm.c:587 2019-09-20 17:08:54.363 EDT [34419] LOG: 0: database system was shut down at 2019-09-20 17:08:54 EDT 2019-09-20 17:08:54.363 EDT [34419] LOCATION: StartupXLOG, xlog.c:6241 2019-09-20 17:08:54.363 EDT [34419] DEBUG: 0: checkpoint record is at 0/15D07F0 2019-09-20 17:08:54.363 EDT [34419] LOCATION: StartupXLOG, xlog.c:6531 2019-09-20 17:08:54.363 EDT [34419] DEBUG: 0: redo record is at 0/15D07F0; shutdown true 2019-09-20 17:08:54.363 EDT [34419] LOCATION: StartupXLOG, xlog.c:6609 2019-09-20 17:08:54.363 EDT [34419] DEBUG: 0: next transaction ID: 490; next OID: 12674 2019-09-20 17:08:54.363 EDT [34419] LOCATION: StartupXLOG, xlog.c:6613 2019-09-20 17:08:54.363 EDT [34419] DEBUG: 0: next MultiXactId: 1; next MultiXactOffset: 0 2019-09-20 17:08:54.363 EDT [34419] LOCATION: StartupXLOG, xlog.c:6616 2019-09-20 17:08:54.363 EDT [34419] DEBUG: 0: oldest unfrozen transaction ID: 483, in database 1 2019-09-20 17:08:54.363 EDT [34419] LOCATION: StartupXLOG, xlog.c:6619 2019-09-20 17:08:54.363 EDT [34419] DEBUG: 0: oldest MultiXactId: 1, in database 1 2019-09-20 17:08:54.363 EDT [34419] LOCATION: StartupXLOG, xlog.c:6622 2019-09-20 17:08:54.363 EDT [34419] DEBUG: 0: commit timestamp Xid oldest/newest: 0/0 2019-09-20 17:08:54.363 EDT [34419] LOCATION: StartupXLOG, xlog.c:6626 2019-09-20 17:08:54.363 EDT [34419] DEBUG: 0: transaction ID wrap limit is 2147484130, limited by database with OID 1 2019-09-20 17:08:54.363 EDT [34419] LOCATION: SetTransactionIdLimit, varsup.c:410 2019-09-20 17:08:54.363 EDT [34419] DEBUG: 0: MultiXactId wrap limit is 2147483648, limited by database with OID 1 2019-09-20 17:08:54.363 EDT [34419] LOCATION: SetMultiXactIdLimit, multixact.c:2271 2019-09-20 17:08:54.363 EDT [34419] DEBUG: 0: starting up replication slots 2019-09-20 17:08:54.363 EDT [34419] LOCATION: StartupReplicationSlots, slot.c:1114 2019-09-20 17:08:54.363 EDT [34419] DEBUG: 0: MultiXactId wrap limit is 2147483648, limited by database with OID 1 2019-09-20 17:08:54.363 EDT [34419] LOCATION: SetMultiXactIdLimit, multixact.c:2271 2019-09-20 17:08:54.363 EDT [34419] DEBUG: 0: MultiXact member stop limit is now 4294914944 based on MultiXact 1 2019-09-20 17:08:54.363 EDT [34419] LOCATION: SetOffsetVacuumLimit, multixact.c:2634 2019-09-20 17:08:54.364 EDT [34418] DEBUG: 0: starting background worker process "logical replication launcher" 2019-09-20 17:08:54.364 EDT [34418] LOCATION: do_start_bgworker, postmaster.c:5749 2019-09-20 17:08:54.364 EDT [34418] LOG: 0: database system is ready to accept connections 2019-09-20 17:08:54.364 EDT [34418] LOCATION: reaper, postmaster.c:3017 2019-09-20 17:08:54.365 EDT [34423] DEBUG: 0: autovacuum launcher started 2019-09-20 17:08:54.365 EDT [34423] LOCATION: AutoVacLauncherMain, autovacuum.c:441 2019-09-20 17:08:54.365 EDT [34425] DEBUG: 0: logical replication launcher s
Re: subscriptionCheck failures on nightjar
Hi, On 2019-09-20 17:49:27 -0400, Tom Lane wrote: > Andres Freund writes: > > On 2019-09-20 16:25:21 -0400, Tom Lane wrote: > >> I recreated my freebsd-9-under-qemu setup and I can still reproduce > >> the problem, though not with high reliability (order of 1 time in 10). > >> Anything particular you want logged? > > > A DEBUG2 log would help a fair bit, because it'd log some information > > about what changes the "horizons" determining when data may be removed. > > Actually, what I did was as attached [1], and I am getting traces like > [2]. The problem seems to occur only when there are two or three > processes concurrently creating the same snapshot file. It's not > obvious from the debug trace, but the snapshot file *does* exist > after the music stops. > > It is very hard to look at this trace and conclude anything other > than "rename(2) is broken, it's not atomic". Nothing in our code > has deleted the file: no checkpoint has started, nor do we see > the DEBUG1 output that CheckPointSnapBuild ought to produce. > But fsync_fname momentarily can't see it (and then later another > process does see it). Yikes. No wondering most of us weren't able to reproduce the problem. And that staring at our code didn't point to a bug. Nice catch. > In short, what we got here is OS bugs that have probably been > resolved years ago. > > The question is what to do next. Should we just retire these > specific buildfarm critters, or do we want to push ahead with > getting rid of the PANIC here? Hm. Given that the fsync failing is actually an issue, I'm somewhat disinclined to remove the PANIC. It's not like only raising an ERROR actually solves anything, except making the problem even harder to diagnose? Or that we otherwise are ok, with renames not being atomic? So I'd be tentatively in favor of either upgrading, replacing the filesystem (perhaps ZFS isn't buggy in the same way?), or retiring those animals. Greetings, Andres Freund
Re: subscriptionCheck failures on nightjar
On 2019-Sep-20, Tom Lane wrote: > Actually, what I did was as attached [1], and I am getting traces like > [2]. The problem seems to occur only when there are two or three > processes concurrently creating the same snapshot file. It's not > obvious from the debug trace, but the snapshot file *does* exist > after the music stops. Uh .. I didn't think it was possible that we would build the same snapshot file more than once. Isn't that a waste of time anyway? Maybe we can fix the symptom by just not doing that in the first place? I don't have a strategy to do that, but seems worth considering before retiring the bf animals. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: subscriptionCheck failures on nightjar
Hi, On September 20, 2019 3:06:20 PM PDT, Alvaro Herrera wrote: >On 2019-Sep-20, Tom Lane wrote: > >> Actually, what I did was as attached [1], and I am getting traces >like >> [2]. The problem seems to occur only when there are two or three >> processes concurrently creating the same snapshot file. It's not >> obvious from the debug trace, but the snapshot file *does* exist >> after the music stops. > >Uh .. I didn't think it was possible that we would build the same >snapshot file more than once. Isn't that a waste of time anyway? >Maybe >we can fix the symptom by just not doing that in the first place? >I don't have a strategy to do that, but seems worth considering before >retiring the bf animals. We try to avoid it, but the check is racy. Check comments in SnapBuildSerialize. We could introduce locking etc to avoid that, but that seems overkill, given that were really just dealing with a broken os. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: subscriptionCheck failures on nightjar
Alvaro Herrera writes: > Uh .. I didn't think it was possible that we would build the same > snapshot file more than once. Isn't that a waste of time anyway? Maybe > we can fix the symptom by just not doing that in the first place? > I don't have a strategy to do that, but seems worth considering before > retiring the bf animals. The comment near the head of SnapBuildSerialize says * there is an obvious race condition here between the time we stat(2) the * file and us writing the file. But we rename the file into place * atomically and all files created need to contain the same data anyway, * so this is perfectly fine, although a bit of a resource waste. Locking * seems like pointless complication. which seems like a reasonable argument. Also, this is hardly the only place where we expect rename(2) to be atomic. So I tend to agree with Andres that we should consider OSes with such a bug to be unsupported. Dromedary is running the last release of macOS that supports 32-bit hardware, so if we decide to kick that to the curb, I'd either shut down the box or put some newer Linux or BSD variant on it. regards, tom lane
Re: WAL recycled despite logical replication slot
Hi, On September 20, 2019 5:45:34 AM PDT, Jeff Janes wrote: >While testing something else (whether "terminating walsender process >due to >replication timeout" was happening spuriously), I had logical >replication >set up streaming a default pgbench transaction load, with the publisher >being 13devel-e1c8743 and subscriber being 12BETA4. Eventually I >started >getting errors about requested wal segments being already removed: > >10863 sub idle 0 2019-09-19 17:14:58.140 EDT LOG: starting logical >decoding for slot "sub" >10863 sub idle 0 2019-09-19 17:14:58.140 EDT DETAIL: Streaming >transactions committing after 79/EB0B17A0, reading WAL from >79/E70736A0. >10863 sub idle 58P01 2019-09-19 17:14:58.140 EDT ERROR: requested WAL >segment 0001007900E7 has already been removed >10863 sub idle 0 2019-09-19 17:14:58.144 EDT LOG: disconnection: >session time: 0:00:00.030 user=jjanes database=jjanes host=10.0.2.2 >port=40830 > >It had been streaming for about 50 minutes before the error showed up, >and >it showed right when streaming was restarting after one of the >replication >timeouts. > >Is there an innocent explanation for this? I thought logical >replication >slots provided an iron-clad guarantee that WAL would be retained until >it >was no longer needed. I am just using pub/sub, none of the lower level >stuff. It indeed should. What's the content of pg_replication_slot for that slot? Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: WIP: Generic functions for Node types using generated metadata
Hi, On 2019-09-19 22:18:57 -0700, Andres Freund wrote: > While working on this I evolved the node string format a bit: > > 1) Node types start with the their "normal" name, rather than >uppercase. There seems little point in having such a divergence. > > 2) The node type is followed by the node-type id. That allows to more >quickly locate the corresponding node metadata (array and one name >recheck, rather than a binary search). I.e. the node starts with >"{Scan 18 " rather than "{SCAN " as before. > > 3) Nodes that contain other nodes as sub-types "inline", still emit {} >for the subtype. There's no functional need for this, but I found the >output otherwise much harder to read. E.g. for mergejoin we'd have >something like > >{MergeJoin 37 :join {Join 35 :plan {Plan ...} :jointype JOIN_INNER ...} > :skip_mark_restore true ...} > > 4) As seen in the above example, enums are decoded to their string >values. I found that makes the output easier to read. Again, not >functionally required. > > 5) Value nodes aren't emitted without a {Value ...} anymore. I changed >this when I expanded the WRITE/READ tests, and encountered failures >because the old encoding is not entirely rountrip safe >(e.g. -INT32_MIN will be parsed as a float at raw parse time, but >after write/read, it'll be parsed as an integer). While that could be >fixed in other ways (e.g. by emitting a trailing . for all floats), I >also found it to be clearer this way - Value nodes are otherwise >undistinguishable from raw strings, raw numbers etc, which is not >great. > > It'd also be easier to now just change the node format to something else. E.g. to just use json. Which'd certainly be a lot easier to delve into, given the amount of tooling (both on the pg SQL level, and for commandline / editors / etc). I don't think it'd be any less efficient. There'd be a few more = signs, but the lexer is smarter / faster than the one currently in use for the outfuncs format. And we'd just reuse pg_parse_json rather than having a dedicated parser. - Andres
Re: backup manifests
On 9/20/19 2:55 PM, Robert Haas wrote: > On Fri, Sep 20, 2019 at 11:09 AM David Steele wrote: >> >> It sucks to make that a prereq for this project but the longer we kick >> that can down the road... > > There are no doubt many patches that would benefit from having more > backend infrastructure exposed in frontend contexts, and I think we're > slowly moving in that direction, but I generally do not believe in > burdening feature patches with major infrastructure improvements. The hardest part about technical debt is knowing when to incur it. It is never a cut-and-dried choice. >> This talk was good fun. The largest number of tables we've seen is a >> few hundred thousand, but that still adds up to more than a million >> files to backup. > > A quick survey of some of my colleagues turned up a few examples of > people with 2-4 million files to backup, so similar kind of ballpark. > Probably not big enough for the manifest to hit the 1GB mark, but > getting close. I have so many doubts about clusters with this many tables, but we do support it, so... >>> I hear you saying that this is going to end up being just as complex >>> in the end, but I don't think I believe it. It sounds to me like the >>> difference between spending a couple of hours figuring this out and >>> spending a couple of months trying to figure it out and maybe not >>> actually getting anywhere. >> >> Maybe the initial implementation will be easier but I am confident we'll >> pay for it down the road. Also, don't we want users to be able to read >> this file? Do we really want them to need to cook up a custom parser in >> Perl, Go, Python, etc.? > > Well, I haven't heard anybody complain that they can't read a > backup_label file because it's too hard to cook up a parser. And I > think the reason is pretty clear: such files are not hard to parse. > Similarly for a pg_hba.conf file. This case is a little more > complicated than those, but AFAICS, not enormously so. Actually, it > seems like a combination of those two cases: it has some fixed > metadata fields that can be represented with one line per field, like > a backup_label, and then a bunch of entries for files that are > somewhat like entries in a pg_hba.conf file, in that they can be > represented by a line per record with a certain number of fields on > each line. Yeah, they are not hard to parse, but *everyone* has to cook up code for it. A bit of a bummer, that. > I attach here a couple of patches. The first one does some > refactoring of relevant code in pg_basebackup, and the second one adds > checksum manifests using a format that I pulled out of my ear. It > probably needs some adjustment but I don't think it's crazy. Each > file gets a line that looks like this: > > File $FILENAME $FILESIZE $FILEMTIME $FILECHECKSUM We also include page checksum validation failures in the file record. Not critical for the first pass, perhaps, but something to keep in mind. > Right now, the file checksums are computed using SHA-256 but it could > be changed to anything else for which we've got code. On my system, > shasum -a256 $FILE produces the same answer that shows up here. At > the bottom of the manifest there's a checksum of the manifest itself, > which looks like this: > > Manifest-Checksum > 385fe156a8c6306db40937d59f46027cc079350ecf5221027d71367675c5f781 > > That's a SHA-256 checksum of the file contents excluding the final > line. It can be verified by feeding all the file contents except the > last line to shasum -a256. I can't help but observe that if the file > were defined to be a JSONB blob, it's not very clear how you would > include a checksum of the blob contents in the blob itself, but with a > format based on a bunch of lines of data, it's super-easy to generate > and super-easy to write tools that verify it. You can do this in JSON pretty easily by handling the terminating brace/bracket: { *, "checksum": } But of course a linefeed-delimited file is even easier. > This is just a prototype so I haven't written a verification tool, and > there's a bunch of testing and documentation and so forth that would > need to be done aside from whatever we've got to hammer out in terms > of design issues and file formats. But I think it's cool, and perhaps > some discussion of how it could be evolved will get us closer to a > resolution everybody can at least live with. I had a quick look and it seems pretty reasonable. I'll need to generate a manifest to see if I can spot any obvious gotchas. -- -David da...@pgmasters.net
Re: Wrong results using initcap() with non normalized string
=?UTF-8?Q?Juan_Jos=C3=A9_Santamar=C3=ADa_Flecha?= writes: > I have come around a strange situation when using a unicode string > that has non normalized characters. The attached script 'initcap.sql' > can reproduce the problem. > The attached patch can fix the issue. If we're going to start worrying about non-normalized characters, I suspect there are far more places than this one that we'd have to consider buggy :-(. As for the details of the patch, it seems overly certain that it's working with UTF8 data. regards, tom lane
Re: Nondeterministic collations vs. text_pattern_ops
Peter Eisentraut writes: >> Here is a draft patch. Where are we on pushing that? I'm starting to get antsy about the amount of time remaining before rc1. It's a low-risk fix, but still, it'd be best to have a complete buildfarm cycle on it before Monday's wrap. regards, tom lane
Re: PGCOLOR? (Re: pgsql: Unified logging system for command-line programs)
Peter Eisentraut writes: > It looks like there is documentation for PG_COLORS in the release notes > now, which seems like an odd place. Suggestions for a better place? BTW, as far as that goes, it looks like PG_COLOR is documented separately in each frontend program's "Environment" man page section. That's a bit duplicative but I don't think we have a better answer right now. Seems like you just need to add boilerplate text about PG_COLORS alongside each of those. regards, tom lane
Re: Optimization of some jsonb functions
I pushed the first few parts. The attached is a rebased copy of the last remaining piece. However, I didn't quite understand what this was doing, so I refrained from pushing. I think there are two patches here: one that adapts the API of findJsonbValueFromContainer and getIthJsonbValueFromContainer to take the output result pointer as an argument, allowing to save palloc cycles just like the newly added getKeyJsonValueFromContainer(); and the other changes JsonbDeepContains so that it uses a new function (which is a function with a weird API that would be extracted from findJsonbValueFromContainer). Also, the current patch just passes NULL into the routines from jsonpath_exec.c but I think it would be useful to pass pointers into stack-allocated result structs instead, at least in getJsonPathVariable. Since the majority of this patchset got pushed, I'll leave this for Nikita to handle for the next commitfest if he wants to, and mark this CF entry as committed. Thanks! -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services diff --git a/src/backend/utils/adt/jsonb_op.c b/src/backend/utils/adt/jsonb_op.c index a64206eeb1..82c4b0b2cb 100644 --- a/src/backend/utils/adt/jsonb_op.c +++ b/src/backend/utils/adt/jsonb_op.c @@ -24,6 +24,7 @@ jsonb_exists(PG_FUNCTION_ARGS) Jsonb *jb = PG_GETARG_JSONB_P(0); text *key = PG_GETARG_TEXT_PP(1); JsonbValue kval; + JsonbValue vval; JsonbValue *v = NULL; /* @@ -38,7 +39,7 @@ jsonb_exists(PG_FUNCTION_ARGS) v = findJsonbValueFromContainer(&jb->root, JB_FOBJECT | JB_FARRAY, - &kval); + &kval, &vval); PG_RETURN_BOOL(v != NULL); } @@ -59,6 +60,7 @@ jsonb_exists_any(PG_FUNCTION_ARGS) for (i = 0; i < elem_count; i++) { JsonbValue strVal; + JsonbValue valVal; if (key_nulls[i]) continue; @@ -69,7 +71,7 @@ jsonb_exists_any(PG_FUNCTION_ARGS) if (findJsonbValueFromContainer(&jb->root, JB_FOBJECT | JB_FARRAY, - &strVal) != NULL) + &strVal, &valVal) != NULL) PG_RETURN_BOOL(true); } @@ -92,6 +94,7 @@ jsonb_exists_all(PG_FUNCTION_ARGS) for (i = 0; i < elem_count; i++) { JsonbValue strVal; + JsonbValue valVal; if (key_nulls[i]) continue; @@ -102,7 +105,7 @@ jsonb_exists_all(PG_FUNCTION_ARGS) if (findJsonbValueFromContainer(&jb->root, JB_FOBJECT | JB_FARRAY, - &strVal) == NULL) + &strVal, &valVal) == NULL) PG_RETURN_BOOL(false); } diff --git a/src/backend/utils/adt/jsonb_util.c b/src/backend/utils/adt/jsonb_util.c index 7e0d9de7f0..7ec33cebf2 100644 --- a/src/backend/utils/adt/jsonb_util.c +++ b/src/backend/utils/adt/jsonb_util.c @@ -33,6 +33,8 @@ #define JSONB_MAX_ELEMS (Min(MaxAllocSize / sizeof(JsonbValue), JB_CMASK)) #define JSONB_MAX_PAIRS (Min(MaxAllocSize / sizeof(JsonbPair), JB_CMASK)) +static JsonbValue *findJsonbElementInArray(JsonbContainer *container, + JsonbValue *elem, JsonbValue *res); static void fillJsonbValue(JsonbContainer *container, int index, char *base_addr, uint32 offset, JsonbValue *result); @@ -327,38 +329,19 @@ compareJsonbContainers(JsonbContainer *a, JsonbContainer *b) */ JsonbValue * findJsonbValueFromContainer(JsonbContainer *container, uint32 flags, - JsonbValue *key) + JsonbValue *key, JsonbValue *res) { - JEntry *children = container->children; int count = JsonContainerSize(container); Assert((flags & ~(JB_FARRAY | JB_FOBJECT)) == 0); - /* Quick out without a palloc cycle if object/array is empty */ + /* Quick out if object/array is empty */ if (count <= 0) return NULL; if ((flags & JB_FARRAY) && JsonContainerIsArray(container)) { - JsonbValue *result = palloc(sizeof(JsonbValue)); - char *base_addr = (char *) (children + count); - uint32 offset = 0; - int i; - - for (i = 0; i < count; i++) - { - fillJsonbValue(container, i, base_addr, offset, result); - - if (key->type == result->type) - { -if (equalsJsonbScalarValue(key, result)) - return result; - } - - JBE_ADVANCE_OFFSET(offset, children[i]); - } - - pfree(result); + return findJsonbElementInArray(container, key, res); } else if ((flags & JB_FOBJECT) && JsonContainerIsObject(container)) { @@ -366,13 +349,48 @@ findJsonbValueFromContainer(JsonbContainer *container, uint32 flags, Assert(key->type == jbvString); return getKeyJsonValueFromContainer(container, key->val.string.val, - key->val.string.len, NULL); + key->val.string.len, res); } /* Not found */ return NULL; } +/* + * Subroutine for findJsonbValueFromContainer + * + * Find scalar element in Jsonb array and return it. + */ +static JsonbValue * +findJsonbElementInArray(JsonbContainer *container, JsonbValue *elem, + JsonbValue *res) +{ + JsonbValue *result; + JEntry *children = container->children; + int count = JsonContainerSize(container)
Re: Wrong results using initcap() with non normalized string
On 2019-Sep-20, Tom Lane wrote: > =?UTF-8?Q?Juan_Jos=C3=A9_Santamar=C3=ADa_Flecha?= > writes: > > I have come around a strange situation when using a unicode string > > that has non normalized characters. The attached script 'initcap.sql' > > can reproduce the problem. For illustration purposes: SELECT initcap('ŞUB'); initcap ─ Şub (1 fila) SELECT initcap('ŞUB'); initcap ─ ŞUb (1 fila) > If we're going to start worrying about non-normalized characters, > I suspect there are far more places than this one that we'd have > to consider buggy :-(. I would think that we have to start somewhere, rather than take the position that we can never do anything about it. (ref: https://www.postgresql.org/message-id/flat/53E179E1.3060404%402ndquadrant.com ) -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: subscriptionCheck failures on nightjar
On Fri, Sep 20, 2019 at 05:30:48PM +0200, Tomas Vondra wrote: > But even with that change you haven't managed to reproduce the issue, > right? Or am I misunderstanding? No, I was not able to see it on my laptop running Debian. -- Michael signature.asc Description: PGP signature
Re: Add "password_protocol" connection parameter to libpq
On Fri, Sep 20, 2019 at 10:57:04AM -0700, Jeff Davis wrote: > Thank you, applied. Okay, I can see which parts you have changed. > * I also changed the comment above pg_fe_scram_channel_bound() to > clarify that the caller must also check that the exchange was > successful. > > * I changed the error message when AUTH_REQ_OK is received without > channel binding. It seemed confusing before. I also added a test. And both make sense. + * Return true if channel binding was employed and the scram exchange upper('scram')? Except for this nit, it looks good to me. -- Michael signature.asc Description: PGP signature
Re: Efficient output for integer types
> "David" == David Fetter writes: David> + /* Compute the result string. */ David> + if (value >= 1) David> + { David> + const uint32 value2 = value % 1; David> + David> + const uint32 c = value2 % 1; David> + const uint32 d = value2 / 1; David> + const uint32 c0 = (c % 100) << 1; David> + const uint32 c1 = (c / 100) << 1; David> + const uint32 d0 = (d % 100) << 1; David> + const uint32 d1 = (d / 100) << 1; David> + David> + char *pos = a + olength - i; David> + David> + value /= 1; David> + David> + memcpy(pos - 2, DIGIT_TABLE + c0, 2); David> + memcpy(pos - 4, DIGIT_TABLE + c1, 2); David> + memcpy(pos - 6, DIGIT_TABLE + d0, 2); David> + memcpy(pos - 8, DIGIT_TABLE + d1, 2); David> + i += 8; David> + } For the 32-bit case, there's no point in doing an 8-digit divide specially, it doesn't save any time. It's sufficient to just change David> + if (value >= 1) to while(value >= 1) in order to process 4 digits at a time. David> + for(int i = 0; i < minwidth - len; i++) David> + { David> + memcpy(str + i, DIGIT_TABLE, 1); David> + } Should be: memset(str, '0', minwidth-len); -- Andrew (irc:RhodiumToad)
Re: pgbench - allow to create partitioned tables
On Sat, Sep 21, 2019 at 12:26 AM Fabien COELHO wrote: > > >> I would not bother to create a patch for so small an improvement. This > >> makes sense in passing because the created function makes it very easy, > >> but otherwise I'll just drop it. > > > > I would prefer to drop for now. > > Attached v13 does that, I added a comment instead. I do not think that it > is an improvement. > > > + else > > + { > > + fprintf(stderr, "unexpected partition method: \"%s\"\n", ps); > > + exit(1); > > + } > > > > If we can't catch that earlier, then it might be better to have some > > version-specific checks rather than such obscure code which is > > difficult to understand for others. > > Hmmm. The code simply checks for the current partitioning and fails if the > result is unknown, which I understood was what you asked, the previous > version was just ignoring the result. > Yes, this code is correct. I am not sure if you understood the point, so let me try again. I am bothered about below code in the patch: + /* only print partitioning information if some partitioning was detected */ + if (partition_method != PART_NONE) This is the only place now where we check 'whether there are any partitions' differently. I am suggesting to make this similar to other checks (if (partitions > 0)). > The likelyhood of postgres dropping support for range or hash partitions > seems unlikely. > > This issue rather be raised if an older partition-enabled pgbench is run > against a newer postgres which adds a new partition method. But then I > cannot guess when a new partition method will be added, so I cannot put a > guard with a version about something in the future. Possibly, if no new > method is ever added, the code will never be triggered. > Sure, even in that case your older version of pgbench will be able to detect by below code: + else + { + fprintf(stderr, "unexpected partition method: \"%s\"\n", ps); + exit(1); + } > > > * improve the comments around query to fetch partitions > > What? How? > > There are already quite a few comments compared to the length of the > query. > Hmm, you have just written what each part of the query is doing which I think one can identify if we write some general comment as I have in the patch to explain the overall intent. Even if we write what each part of the statement is doing, the comment explaining overall intent is required. I personally don't like writing a comment for each sub-part of the query as that makes reading the query difficult. See the patch sent by me in my previous email. > > * improve the comments in the patch and make the code look like nearby > > code > > This requirement is to fuzzy. I re-read the changes, and both code and > comments look okay to me. > I have done that in some of the cases in the patch attached by me in my last email. Have you looked at those changes? Try to make those changes in the next version unless you see something wrong is written in comments. > > * pgindent the patch > > Done. > > > I think we should try to add some note or comment that why we only > > choose to partition pgbench_accounts table when the user has given > > --partitions option. > > Added as a comment on the initPartition function. > I am not sure if something like that is required in the docs, but we can leave it for now. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Don't codegen deform code for virtual tuples in expr eval for scan fetch
Hello, In my previous patch 0001, the resulting opblock consisted of a single br instruction to it's successor opblock. Such a block represents unnecessary overhead. Even though such a block would be optimized away, what if optimization is not performed (perhaps due to jit_optimize_above_cost)? Perhaps we could be more aggressive. We could maybe remove the opblock altogether. However, such a solution is not without complexity. Such unlinking is non-trivial, and is one of the things the simplify-cfg pass takes care of. One solution could be to run this pass ad-hoc for the evalexpr function. Or we could perform the unlinking during codegen. Such a solution is presented below. To unlink the current opblock from the cfg we have to replace all of its uses. From the current state of the code, these uses are either: 1. Terminator instruction of opblocks[i - 1] 2. Terminator instruction of some sub-op-block of opblocks[i - 1]. By sub-op-block, I mean some block that is directly linked from opblock[i - 1] but not opblocks[i]. 3. Terminator instruction of the entry block. We should replace all of these uses with opblocks[i + 1] and then and only then can we delete opblocks[i]. My latest patch v2-0001 in my latest patch set, achieves this. I guard LLVMReplaceAllUsesWith() with Assert()s to ensure that we don't accidentally replace non-terminator uses of opblocks[i], should they be introduced in the future. If these asserts fail in the future, replacing these non-terminator instructions with undefs constitutes a commonly adopted solution. Otherwise, we can always fall back to making opblocks[i] empty with just the unconditional br, as in my previous patch 0001. -- Soumyadeep On Tue, Sep 17, 2019 at 11:54 PM Soumyadeep Chakraborty < soumyadeep2...@gmail.com> wrote: > Hello Hackers, > > This is to address a TODO I found in the JIT expression evaluation > code (opcode = > EEOP_INNER_FETCHSOME/EEOP_OUTER_FETCHSOME/EEOP_SCAN_FETCHSOME): > > * TODO: skip nvalid check if slot is fixed and known to > * be a virtual slot. > > Not only should we skip the nvalid check if the tuple is virtual, the > whole basic block should be a no-op. There is no deforming to be done, > JIT (slot_compile_deform) or otherwise (slot_getsomeattrs) for virtual > tuples. > > Attached is a patch 0001 that achieves exactly that by: > > 1. Performing the check at the very beginning of the case. > 2. Emitting an unconditional branch to the next opblock if we have a > virtual tuple. We end up with a block with just the single > unconditional branch instruction in this case. This block is optimized > away when we run llvm_optimize_module(). > > Also enclosed is another patch 0002 that adds on some minor > refactoring of conditionals around whether to JIT deform or not. > > --- > Soumyadeep Chakraborty > v2-0001-Don-t-codegen-if-tuple-is-virtual-in-expr-eval-of.patch Description: Binary data v2-0002-Minor-refactor-JIT-deform-or-not.patch Description: Binary data
Re: Efficient output for integer types
On Sat, Sep 21, 2019 at 03:36:21AM +0100, Andrew Gierth wrote: > > "David" == David Fetter writes: > > David> + /* Compute the result string. */ > David> + if (value >= 1) > David> + { > David> + const uint32 value2 = value % 1; > David> + > David> + const uint32 c = value2 % 1; > David> + const uint32 d = value2 / 1; > David> + const uint32 c0 = (c % 100) << 1; > David> + const uint32 c1 = (c / 100) << 1; > David> + const uint32 d0 = (d % 100) << 1; > David> + const uint32 d1 = (d / 100) << 1; > David> + > David> + char *pos = a + olength - i; > David> + > David> + value /= 1; > David> + > David> + memcpy(pos - 2, DIGIT_TABLE + c0, 2); > David> + memcpy(pos - 4, DIGIT_TABLE + c1, 2); > David> + memcpy(pos - 6, DIGIT_TABLE + d0, 2); > David> + memcpy(pos - 8, DIGIT_TABLE + d1, 2); > David> + i += 8; > David> + } > > For the 32-bit case, there's no point in doing an 8-digit divide > specially, it doesn't save any time. It's sufficient to just change > > David> + if (value >= 1) > > to while(value >= 1) Done. > in order to process 4 digits at a time. > > David> + for(int i = 0; i < minwidth - len; i++) > David> + { > David> + memcpy(str + i, DIGIT_TABLE, 1); > David> + } > > Should be: > memset(str, '0', minwidth-len); Done. Best, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate >From 4407b0771cd53890acf41ffee8908d1a701c52c0 Mon Sep 17 00:00:00 2001 From: David Fetter Date: Sun, 15 Sep 2019 00:06:29 -0700 Subject: [PATCH v10] Make int4 and int8 operations more efficent To: hackers MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="2.21.0" This is a multi-part message in MIME format. --2.21.0 Content-Type: text/plain; charset=UTF-8; format=fixed Content-Transfer-Encoding: 8bit - Output routines now do more digits per iteration, and - Code determines the number of decimal digits in int4/int8 efficiently - Split off pg_ltoa_n from pg_ltoa - Use same to make other functions shorter diff --git a/src/backend/access/common/printsimple.c b/src/backend/access/common/printsimple.c index 651ade14dd..5c5b6d33b2 100644 --- a/src/backend/access/common/printsimple.c +++ b/src/backend/access/common/printsimple.c @@ -112,7 +112,7 @@ printsimple(TupleTableSlot *slot, DestReceiver *self) case INT8OID: { int64 num = DatumGetInt64(value); - char str[23]; /* sign, 21 digits and '\0' */ + char str[MAXINT8LEN + 1]; pg_lltoa(num, str); pq_sendcountedtext(&buf, str, strlen(str), false); diff --git a/src/backend/utils/adt/int8.c b/src/backend/utils/adt/int8.c index 0ff9394a2f..6230807906 100644 --- a/src/backend/utils/adt/int8.c +++ b/src/backend/utils/adt/int8.c @@ -27,8 +27,6 @@ #include "utils/builtins.h" -#define MAXINT8LEN 25 - typedef struct { int64 current; diff --git a/src/backend/utils/adt/numutils.c b/src/backend/utils/adt/numutils.c index 70138feb29..babc6f6166 100644 --- a/src/backend/utils/adt/numutils.c +++ b/src/backend/utils/adt/numutils.c @@ -20,6 +20,68 @@ #include "common/int.h" #include "utils/builtins.h" +#include "port/pg_bitutils.h" + +/* + * A table of all two-digit numbers. This is used to speed up decimal digit + * generation by copying pairs of digits into the final output. + */ +static const char DIGIT_TABLE[200] = +"00" "01" "02" "03" "04" "05" "06" "07" "08" "09" +"10" "11" "12" "13" "14" "15" "16" "17" "18" "19" +"20" "21" "22" "23" "24" "25" "26" "27" "28" "29" +"30" "31" "32" "33" "34" "35" "36" "37" "38" "39" +"40" "41" "42" "43" "44" "45" "46" "47" "48" "49" +"50" "51" "52" "53" "54" "55" "56" "57" "58" "59" +"60" "61" "62" "63" "64" "65" "66" "67" "68" "69" +"70" "71" "72" "73" "74" "75" "76" "77" "78" "79" +"80" "81" "82" "83" "84" "85" "86" "87" "88" "89" +"90" "91" "92" "93" "94" "95" "96" "97" "98" "99"; + +/* + * Adapted from http://graphics.stanford.edu/~seander/bithacks.html#IntegerLog10 + */ +static inline uint32 +decimalLength32(const uint32 v) +{ + uint32 t; + static uint32 PowersOfTen[] = + {1,10,100, + 1000, 1, 10, + 100, 1000, 1, + 10}; + /* + * Compute base-10 logarithm by dividing the base-2 logarithm + * by a good-enough approximation of the base-2 logarithm of 10 + */ + t = (pg_leftmost_one_pos32(v) + 1)*1233/4096; + return t + (v >= PowersOfTen[t]); +} + +static inline uint32 +decimalLength64(const uint64_t v) +{ + uint32 t; + static uint64_t PowersOfTen[] = { + UINT64CONST(1), UINT64CONST(10), +
Re: A problem about partitionwise join
On Fri, Sep 20, 2019 at 2:33 PM Richard Guo wrote: > > Hi Dilip, > > Thank you for reviewing this patch. > > On Fri, Sep 20, 2019 at 12:48 PM Dilip Kumar wrote: >> >> On Thu, Aug 29, 2019 at 3:15 PM Richard Guo wrote: >> > >> > >> > Attached is a patch as an attempt to address this issue. The idea is >> > quite straightforward. When building partition info for joinrel, we >> > generate any possible EC-derived joinclauses of form 'outer_em = >> > inner_em', which will be used together with the original restrictlist to >> > check if there exists an equi-join condition for each pair of partition >> > keys. >> > >> > Any comments are welcome! >> /* >> + * generate_join_implied_equalities_for_all >> + * Create any EC-derived joinclauses of form 'outer_em = inner_em'. >> + * >> + * This is used when building partition info for joinrel. >> + */ >> +List * >> +generate_join_implied_equalities_for_all(PlannerInfo *root, >> + Relids join_relids, >> + Relids outer_relids, >> + Relids inner_relids) >> >> I think we need to have more detailed comments about why we need this >> separate function, we can also explain that >> generate_join_implied_equalities function will avoid >> the join clause if EC has the constant but for partition-wise join, we >> need that clause too. > > > Thank you for the suggestion. > >> >> >> + while ((i = bms_next_member(matching_ecs, i)) >= 0) >> + { >> + EquivalenceClass *ec = (EquivalenceClass *) list_nth(root->eq_classes, i); >> + List*outer_members = NIL; >> + List*inner_members = NIL; >> + ListCell *lc1; >> + >> + /* Do not consider this EC if it's ec_broken */ >> + if (ec->ec_broken) >> + continue; >> + >> + /* Single-member ECs won't generate any deductions */ >> + if (list_length(ec->ec_members) <= 1) >> + continue; >> + >> >> I am wondering isn't it possible to just process the missing join >> clause? I mean 'generate_join_implied_equalities' has only skipped >> the ECs which has const so >> can't we create join clause only for those ECs and append it the >> "Restrictlist" we already have? I might be missing something? > > > For ECs without const, 'generate_join_implied_equalities' also has > skipped some join clauses since it only selects the joinclause with > 'best_score' between outer members and inner members. And the missing > join clauses are needed to generate partitionwise join. That's why the > query below cannot be planned as partitionwise join, as we have missed > joinclause 'foo.k = bar.k'. oh right. I missed that part. > select * from p as foo join p as bar on foo.k = bar.val and foo.k = bar.k; > > And yes 'generate_join_implied_equalities_for_all' will create join > clauses that have existed in restrictlist. I think it's OK since the > same RestrictInfo deduced from EC will share the same pointer and > list_concat_unique_ptr will make sure there are no duplicates in the > restrictlist used by have_partkey_equi_join. > ok -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Efficient output for integer types
> "David" == David Fetter writes: David> +static inline uint32 David> +decimalLength64(const uint64_t v) Should be uint64, not uint64_t. Also return an int, not a uint32. For int vs. int32, my own inclination is to use "int" where the value is just a (smallish) number, especially one that will be used as an index or loop count, and use "int32" when it actually matters that it's 32 bits rather than some other size. Other opinions may differ. David> +{ David> + uint32 t; David> + static uint64_t PowersOfTen[] = { uint64 not uint64_t here too. David> +int32 David> +pg_ltoa_n(uint32 value, char *a) If this is going to handle only unsigned values, it should probably be named pg_ultoa_n. David> + uint32 i = 0, adjust = 0; "adjust" is not assigned anywhere else. Presumably that's from previous handling of negative numbers? David> + memcpy(a, "0", 1); *a = '0'; would suffice. David> + i += adjust; Superfluous? David> + uint32_tuvalue = (uint32_t)value; uint32 not uint32_t. David> + int32 len; See above re. int vs. int32. David> + uvalue = (uint32_t)0 - (uint32_t)value; Should be uint32 not uint32_t again. For anyone wondering, I suggested this to David in place of the ugly special casing of INT32_MIN. This method avoids the UB of doing (-value) where value==INT32_MIN, and is nevertheless required to produce the correct result: 1. If value < 0, then ((uint32)value) is (value + UINT32_MAX + 1) 2. (uint32)0 - (uint32)value becomes (UINT32_MAX+1)-(value+UINT32_MAX+1) which is (-value) as required David> +int32 David> +pg_lltoa_n(uint64_t value, char *a) Again, if this is doing unsigned, then it should be named pg_ulltoa_n David> + if (value == PG_INT32_MIN) This being inconsistent with the others is not nice. -- Andrew (irc:RhodiumToad)