Re: Custom reloptions and lock modes

2019-09-20 Thread Michael Paquier
On Fri, Sep 20, 2019 at 11:59:13AM +0530, Kuntal Ghosh wrote:
> On Fri, Sep 20, 2019 at 7:08 AM Michael Paquier  wrote:
>> Hence attached is a patch set to address those issues:
>> - 0001 makes sure that any existing module creating a custom reloption
>> has the lock mode set to AccessExclusiveMode, which would be a sane
>> default anyway.  I think that we should just back-patch that and close
>> any holes.
>
> Looks good to me. The patch solves the issue and passes with
> regression tests. IMHO, it should be back-patched to all the branches.

That's the plan but...

>> - 0002 is a patch which we could use to extend the existing reloption
>> APIs to set the lock mode used.  I am aware of the recent work done by
>> Nikolay in CC to rework this API set, but I am unsure where we are
>> going there, and the resulting patch is actually very short, being
>> 20-line long with the current infrastructure.  That could go into
>> HEAD.  Table AMs have been added in v12 so custom reloptions could
>> gain more in popularity, but as we are very close to the release it
>> would not be cool to break those APIs.  The patch simplicity could
>> also be a reason sufficient for a back-patch, and I don't think that
>> there are many users of them yet.
>>
>
> I think this is good approach for now and can be committed on the
> HEAD only.

Let's wait a couple of days to see if others have any objections to
offer on the matter.  My plan would be to revisit this patch set after
RC1 is tagged next week to at least fix the bug.  I don't predict any
strong objections to the patch for HEAD, but who knows..

> One small thing:
> 
>   add_int_reloption(bl_relopt_kind, buf,
>"Number of bits generated for each index column",
> -  DEFAULT_BLOOM_BITS, 1, MAX_BLOOM_BITS);
> +  DEFAULT_BLOOM_BITS, 1, MAX_BLOOM_BITS,
> +  AccessExclusiveLock);
> Do we need a comment to explain why we're using AccessExclusiveLock in
> this case?

Because that's the safest default to use here?  That seemed obvious to
me.
--
Michael


signature.asc
Description: PGP signature


Re: Custom reloptions and lock modes

2019-09-20 Thread Kuntal Ghosh
On Fri, Sep 20, 2019 at 12:38 PM Michael Paquier  wrote:
>
> > One small thing:
> >
> >   add_int_reloption(bl_relopt_kind, buf,
> >"Number of bits generated for each index column",
> > -  DEFAULT_BLOOM_BITS, 1, MAX_BLOOM_BITS);
> > +  DEFAULT_BLOOM_BITS, 1, MAX_BLOOM_BITS,
> > +  AccessExclusiveLock);
> > Do we need a comment to explain why we're using AccessExclusiveLock in
> > this case?
>
> Because that's the safest default to use here?  That seemed obvious to
> me.
Okay. Sounds good.



-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com




Re: backup manifests

2019-09-20 Thread Michael Paquier
On Thu, Sep 19, 2019 at 11:10:46PM -0400, David Steele wrote:
> On 9/19/19 11:00 AM, Robert Haas wrote:
>> On Thu, Sep 19, 2019 at 9:51 AM Robert Haas  wrote:
>> > I intend that we should be able to support incremental backups based
>> > either on a previous full backup or based on a previous incremental
>> > backup. I am not aware of a technical reason why we need to identify
>> > the specific backup that must be used. If incremental backup B is
>> > taken based on a pre-existing backup A, then I think that B can be
>> > restored using either A or *any other backup taken after A and before
>> > B*. In the normal case, there probably wouldn't be any such backup,
>> > but AFAICS the start-LSNs are a sufficient cross-check that the chosen
>> > base backup is legal.
>> 
>> Scratch that: there can be overlapping backups, so you have to
>> cross-check both start and stop LSNs.
> 
> Overall we have found it's much simpler to label each backup and cross-check
> that against the pg version and system id.  Start LSN is pretty unique, but
> backup labels work really well and are more widely understood.

Warning.  The start LSN could be the same for multiple backups when
taken from a standby.
--
Michael


signature.asc
Description: PGP signature


Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)

2019-09-20 Thread Paul Guo
The patch series failed on Windows CI. I modified the Windows build file to
fix that. See attached for the v7 version.


v7-0003-Ensure-target-clean-shutdown-at-the-beginning-of-.patch
Description: Binary data


v7-0001-Extact-common-recovery-related-functions-from-pg_.patch
Description: Binary data


v7-0002-Add-option-to-write-recovery-configuration-inform.patch
Description: Binary data


Re: psql - add SHOW_ALL_RESULTS option

2019-09-20 Thread Fabien COELHO



Should we add function header for the below function to maintain the
common standard of this file:


Yes. Attached v6 does that.

--
Fabien.diff --git a/contrib/pg_stat_statements/expected/pg_stat_statements.out b/contrib/pg_stat_statements/expected/pg_stat_statements.out
index 6787ec1efd..de59a5cf65 100644
--- a/contrib/pg_stat_statements/expected/pg_stat_statements.out
+++ b/contrib/pg_stat_statements/expected/pg_stat_statements.out
@@ -49,17 +49,42 @@ BEGIN \;
 SELECT 2.0 AS "float" \;
 SELECT 'world' AS "text" \;
 COMMIT;
+ float 
+---
+   2.0
+(1 row)
+
+ text  
+---
+ world
+(1 row)
+
 -- compound with empty statements and spurious leading spacing
-\;\;   SELECT 3 + 3 \;\;\;   SELECT ' ' || ' !' \;\;   SELECT 1 + 4 \;;
- ?column? 
---
-5
+\;\;   SELECT 3 + 3 AS "+" \;\;\;   SELECT ' ' || ' !' AS "||" \;\;   SELECT 1 + 4 AS "+" \;;
+ + 
+---
+ 6
+(1 row)
+
+ ||  
+-
+   !
+(1 row)
+
+ + 
+---
+ 5
 (1 row)
 
 -- non ;-terminated statements
 SELECT 1 + 1 + 1 AS "add" \gset
 SELECT :add + 1 + 1 AS "add" \;
 SELECT :add + 1 + 1 AS "add" \gset
+ add 
+-
+   5
+(1 row)
+
 -- set operator
 SELECT 1 AS i UNION SELECT 2 ORDER BY i;
  i 
@@ -102,12 +127,12 @@ SELECT query, calls, rows FROM pg_stat_statements ORDER BY query COLLATE "C";
  SELECT $1+| 4 |4
   +|   | 
AS "text"   |   | 
- SELECT $1 + $2| 2 |2
  SELECT $1 + $2 + $3 AS "add"  | 3 |3
+ SELECT $1 + $2 AS "+" | 2 |2
  SELECT $1 AS "float"  | 1 |1
  SELECT $1 AS "int"| 2 |2
  SELECT $1 AS i UNION SELECT $2 ORDER BY i | 1 |2
- SELECT $1 || $2   | 1 |1
+ SELECT $1 || $2 AS "||"   | 1 |1
  SELECT pg_stat_statements_reset() | 1 |1
  WITH t(f) AS (   +| 1 |2
VALUES ($1), ($2)  +|   | 
diff --git a/contrib/pg_stat_statements/sql/pg_stat_statements.sql b/contrib/pg_stat_statements/sql/pg_stat_statements.sql
index 8b527070d4..ea3a31176e 100644
--- a/contrib/pg_stat_statements/sql/pg_stat_statements.sql
+++ b/contrib/pg_stat_statements/sql/pg_stat_statements.sql
@@ -27,7 +27,7 @@ SELECT 'world' AS "text" \;
 COMMIT;
 
 -- compound with empty statements and spurious leading spacing
-\;\;   SELECT 3 + 3 \;\;\;   SELECT ' ' || ' !' \;\;   SELECT 1 + 4 \;;
+\;\;   SELECT 3 + 3 AS "+" \;\;\;   SELECT ' ' || ' !' AS "||" \;\;   SELECT 1 + 4 AS "+" \;;
 
 -- non ;-terminated statements
 SELECT 1 + 1 + 1 AS "add" \gset
diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index 7789fc6177..4e6ab5a0a5 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -127,18 +127,11 @@ echo '\x \\ SELECT * FROM foo;' | psql
commands included in the string to divide it into multiple
transactions.  (See 
for more details about how the server handles multi-query strings.)
-   Also, psql only prints the
-   result of the last SQL command in the string.
-   This is different from the behavior when the same string is read from
-   a file or fed to psql's standard input,
-   because then psql sends
-   each SQL command separately.
   
   
-   Because of this behavior, putting more than one SQL command in a
-   single -c string often has unexpected results.
-   It's better to use repeated -c commands or feed
-   multiple commands to psql's standard input,
+   If having several commands executed in one transaction is not desired, 
+   use repeated -c commands or feed multiple commands to
+   psql's standard input,
either using echo as illustrated above, or
via a shell here-document, for example:
 
@@ -3355,10 +3348,8 @@ select 1\; select 2\; select 3;
 commands included in the string to divide it into multiple
 transactions.  (See 
 for more details about how the server handles multi-query strings.)
-psql prints only the last query result
-it receives for each request; in this example, although all
-three SELECTs are indeed executed, psql
-only prints the 3.
+psql prints all results it receives, one
+after the other.
 
 
   
diff --git a/src/bin/psql/common.c b/src/bin/psql/common.c
index 4b2679360f..db6d031133 100644
--- a/src/bin/psql/common.c
+++ b/src/bin/psql/common.c
@@ -486,6 +486,19 @@ ResetCancelConn(void)
 #endif
 }
 
+/*
+ * Show error message from result, if any, and check connection in passing.
+ */
+static void
+ShowErrorMessage(const PGresult *result)
+{
+	const char *error = PQerrorMessage(pset.d

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2019-09-20 Thread Jose Luis Tallon

On 20/9/19 4:06, Michael Paquier wrote:

On Thu, Sep 19, 2019 at 05:40:41PM +0300, Alexey Kondratov wrote:

On 19.09.2019 16:21, Robert Haas wrote:

So, earlier in this thread, I suggested making this part of ALTER
TABLE, and several people seemed to like that idea. Did we have a
reason for dropping that approach?

Personally, I don't find this idea very attractive as ALTER TABLE is
already complicated enough with all the subqueries we already support
in the command, all the logic we need to maintain to make combinations
of those subqueries in a minimum number of steps, and also the number
of bugs we have seen because of the amount of complication present.


Yes, but please keep the other options: At it is, cluster, vacuum full 
and reindex already rewrite the table in full; Being able to write the 
result to a different tablespace than the original object was stored in 
enables a whole world of very interesting possibilities including a 
quick way out of a "so little disk space available that vacuum won't 
work properly" situation --- which I'm sure MANY users will appreciate, 
including me



If we add this option to REINDEX, then for 'ALTER TABLE tb_name action1,
REINDEX SET TABLESPACE tbsp_name, action3' action2 will be just a direct
alias to 'REINDEX TABLE tb_name SET TABLESPACE tbsp_name'. So it seems
practical to do this for REINDEX first.

The only one concern I have against adding REINDEX to ALTER TABLE in this
context is that it will allow user to write such a chimera:

ALTER TABLE tb_name REINDEX SET TABLESPACE tbsp_name, SET TABLESPACE
tbsp_name;

when they want to move both table and all the indexes. Because simple
ALTER TABLE tb_name REINDEX, SET TABLESPACE tbsp_name;
looks ambiguous. Should it change tablespace of table, indexes or both?


Indeed.

IMHO, that form of the command should not allow that much flexibility... 
even on the "principle of least surprise" grounds :S


That is, I'd restrict the ability to change (output) tablespace to the 
"direct" form --- REINDEX name, VACUUM (FULL) name, CLUSTER name --- 
whereas the ALTER table|index SET TABLESPACE would continue to work.


Now that I come to think of it, maybe saying "output" or "move to" 
rather than "set tablespace" would make more sense for this variation of 
the commands? (clearer, less prone to confusion)?



Tricky question, but we don't change the tablespace of indexes when
using an ALTER TABLE, so I would say no on compatibility grounds.
ALTER TABLE has never touched the tablespace of indexes, and I don't
think that we should begin to do so.


Indeed.


I might be missing something, but is there any reason to not *require* a 
explicit transaction for the above multi-action commands? I mean, have 
it be:


BEGIN;

ALTER TABLE tb_name SET TABLESPACE tbsp_name;    -- moves the table  
but possibly NOT the indexes?


ALTER TABLE tb_name REINDEX [OUTPUT TABLESPACE tbsp_name];    -- 
REINDEX, placing the resulting index on tbsp_name instead of the 
original one


COMMIT;

... and have the parser/planner combine the steps if it'd make sense (it 
probably wouldn't in this example)?



Just my .02€


Thanks,

    / J.L.






Re: A problem about partitionwise join

2019-09-20 Thread Richard Guo
Hi Dilip,

Thank you for reviewing this patch.

On Fri, Sep 20, 2019 at 12:48 PM Dilip Kumar  wrote:

> On Thu, Aug 29, 2019 at 3:15 PM Richard Guo  wrote:
> >
> >
> > Attached is a patch as an attempt to address this issue. The idea is
> > quite straightforward. When building partition info for joinrel, we
> > generate any possible EC-derived joinclauses of form 'outer_em =
> > inner_em', which will be used together with the original restrictlist to
> > check if there exists an equi-join condition for each pair of partition
> > keys.
> >
> > Any comments are welcome!
>  /*
> + * generate_join_implied_equalities_for_all
> + *   Create any EC-derived joinclauses of form 'outer_em = inner_em'.
> + *
> + * This is used when building partition info for joinrel.
> + */
> +List *
> +generate_join_implied_equalities_for_all(PlannerInfo *root,
> + Relids join_relids,
> + Relids outer_relids,
> + Relids inner_relids)
>
> I think we need to have more detailed comments about why we need this
> separate function, we can also explain that
> generate_join_implied_equalities function will avoid
> the join clause if EC has the constant but for partition-wise join, we
> need that clause too.
>

Thank you for the suggestion.


>
> + while ((i = bms_next_member(matching_ecs, i)) >= 0)
> + {
> + EquivalenceClass *ec = (EquivalenceClass *) list_nth(root->eq_classes,
> i);
> + List*outer_members = NIL;
> + List*inner_members = NIL;
> + ListCell   *lc1;
> +
> + /* Do not consider this EC if it's ec_broken */
> + if (ec->ec_broken)
> + continue;
> +
> + /* Single-member ECs won't generate any deductions */
> + if (list_length(ec->ec_members) <= 1)
> + continue;
> +
>
> I am wondering isn't it possible to just process the missing join
> clause?  I mean 'generate_join_implied_equalities' has only skipped
> the ECs which has const so
> can't we create join clause only for those ECs and append it the
> "Restrictlist" we already have?  I might be missing something?
>

For ECs without const, 'generate_join_implied_equalities' also has
skipped some join clauses since it only selects the joinclause with
'best_score' between outer members and inner members. And the missing
join clauses are needed to generate partitionwise join. That's why the
query below cannot be planned as partitionwise join, as we have missed
joinclause 'foo.k = bar.k'.

select * from p as foo join p as bar on foo.k = bar.val and foo.k = bar.k;

And yes 'generate_join_implied_equalities_for_all' will create join
clauses that have existed in restrictlist. I think it's OK since the
same RestrictInfo deduced from EC will share the same pointer and
list_concat_unique_ptr will make sure there are no duplicates in the
restrictlist used by have_partkey_equi_join.

Thanks
Richard


Re: psql - add SHOW_ALL_RESULTS option

2019-09-20 Thread vignesh C
On Fri, Sep 20, 2019 at 1:41 PM Fabien COELHO  wrote:
>
>
> > Should we add function header for the below function to maintain the
> > common standard of this file:
>
> Yes. Attached v6 does that.
>
Thanks for fixing it.

The below addition can be removed, it seems to be duplicate:
@@ -3734,6 +3734,11 @@ listTables(const char *tabtypes, const char
*pattern, bool verbose, bool showSys
  translate_columns[cols_so_far] = true;
  }

+ /*
+ * We don't bother to count cols_so_far below here, as there's no need
+ * to; this might change with future additions to the output columns.
+ */
+
  /*
  * We don't bother to count cols_so_far below here, as there's no need
  * to; this might change with future additions to the output columns.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




Re: d25ea01275 and partitionwise join

2019-09-20 Thread Richard Guo
On Thu, Sep 19, 2019 at 4:15 PM Amit Langote 
wrote:

> Hi Richard,
>
> Thanks a lot for taking a close look at the patch and sorry about the
> delay.
>
> On Wed, Sep 4, 2019 at 5:29 PM Richard Guo  wrote:
> >> On Wed, Sep 4, 2019 at 10:01 AM Amit Langote 
> wrote:
> > I'm reviewing v2-0002 and I have concern about how COALESCE expr is
> > processed in match_join_arg_to_partition_keys().
> >
> > If there is a COALESCE expr with first arg being non-partition key expr
> > and second arg being partition key, the patch would match it to the
> > partition key, which may result in wrong results in some cases.
> >
> > For instance, consider the partition table below:
> >
> > create table p (k int, val int) partition by range(k);
> > create table p_1 partition of p for values from (1) to (10);
> > create table p_2 partition of p for values from (10) to (100);
> >
> > So with patch v2-0002, the following query will be planned with
> > partitionwise join.
> >
> > # explain (costs off)
> > select * from (p as t1 full join p as t2 on t1.k = t2.k) as
> t12(k1,val1,k2,val2)
> > full join p as t3 on COALESCE(t12.val1,
> t12.k1) = t3.k;
> > QUERY PLAN
> > --
> >  Append
> >->  Hash Full Join
> >  Hash Cond: (COALESCE(t1.val, t1.k) = t3.k)
> >  ->  Hash Full Join
> >Hash Cond: (t1.k = t2.k)
> >->  Seq Scan on p_1 t1
> >->  Hash
> >  ->  Seq Scan on p_1 t2
> >  ->  Hash
> >->  Seq Scan on p_1 t3
> >->  Hash Full Join
> >  Hash Cond: (COALESCE(t1_1.val, t1_1.k) = t3_1.k)
> >  ->  Hash Full Join
> >Hash Cond: (t1_1.k = t2_1.k)
> >->  Seq Scan on p_2 t1_1
> >->  Hash
> >  ->  Seq Scan on p_2 t2_1
> >  ->  Hash
> >->  Seq Scan on p_2 t3_1
> > (19 rows)
> >
> > But as t1.val is not a partition key, actually we cannot use
> > partitionwise join here.
> >
> > If we insert below data into the table, we will get wrong results for
> > the query above.
> >
> > insert into p select 5,15;
> > insert into p select 15,5;
>
> Good catch!  It's quite wrong to use COALESCE(t12.val1, t12.k1) = t3.k
> for partitionwise join as the COALESCE expression might as well output
> the value of val1 which doesn't conform to partitioning.
>
> I've fixed match_join_arg_to_partition_keys() to catch that case and
> fail.  Added a test case as well.
>
> Please find attached updated patches.
>

Thank you for the fix. Will review.

Thanks
Richard


Re: Avoiding possible future conformance headaches in JSON work

2019-09-20 Thread Peter Eisentraut
On 2019-09-20 01:14, Tom Lane wrote:
> Chapman Flack  writes:
>> Sure, against *every* non-spec feature we have or ever will have, someone
>> /could/ raise a generic "what if SQL committee might add something pretty
>> similar in future".
>> But what we have in this case are specific non-spec features (array, map,
>> and sequence constructors, lambdas, map/fold/reduce, user-defined
>> functions) that are flat-out already present in the current version of
>> the language that the SQL committee is clearly modeling jsonpath on.
> 
> Sure.  But we also modeled those features on the same language that the
> committee is looking at (or at least I sure hope we did).  So it's
> reasonable to assume that they would come out at the same spot without
> any prompting.  And we can prompt them ;-).

Also, I understand these are features proposed for PG13, not in PG12.
So while this is an important discussion, it's not relevant to the PG12
release, right?

(If so, I'm content to close these open items.)

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: pgbench - allow to create partitioned tables

2019-09-20 Thread Amit Kapila
On Fri, Sep 20, 2019 at 10:29 AM Fabien COELHO  wrote:
>
> >> The behavior is not actually changed, but I had to move fillfactor away
> >> because it cannot be declared on partitioned tables, it must be declared
> >> on partitions only. Once there is a function to handle that it is pretty
> >> easy to add the test.
> >>
> >> I can remove it but franckly there are only benefits: the default is now
> >> tested by pgbench, the create query is smaller, and it would work with
> >> older versions of pg, which does not matter but is good on principle.
> >
> > I am not saying that it is a bad check on its own, rather it might be
> > good, but let's not do any unrelated change as that will delay the
> > main patch.  Once, we are done with the main patch, you can propose
> > these as improvements.
>
> I would not bother to create a patch for so small an improvement. This
> makes sense in passing because the created function makes it very easy,
> but otherwise I'll just drop it.
>

I would prefer to drop for now.

> >> The user could do a -i with a version of pgbench and bench with another
> >> one. I do that often while developing…
> >
> > I am not following what you want to say here especially ("pgbench and
> > bench with another one").  Can you explain with some example?
>
> While developing, I often run pgbench under development client against an
> already created set of tables on an already created cluster, and usually
> the server side on my laptop is the last major release from pgdg (ie 11.5)
> while the pgbench I'm testing is from sources (ie 12dev). If I type
> "pgbench" I run 11.5, and in the sources "./pgbench" runs the dev version.
>

Hmm, I think some such thing is possible when you are running pgbench
of lower version with tables initialized by some higher version of
pgbench.  Because higher version pgbench must be a superset of lower
version unless we drop support for one of the partitioning method.  I
think even if there is some unknown partition method, it should be
detected much earlier rather than reaching the stage of printing the
results like after the query for partitions in below code.

+ else
+ {
+ fprintf(stderr, "unexpected partition method: \"%s\"\n", ps);
+ exit(1);
+ }

If we can't catch that earlier, then it might be better to have some
version-specific checks rather than such obscure code which is
difficult to understand for others.

I have made a few modifications in the attached patch.
* move the create partitions related code into a separate function.
* make the check related to number of partitions consistent i.e check
partitions > 0 apart from where we print which I also want to change
but let us first discuss one of the above points
* when we don't found pgbench_accounts table, error out instead of continuing
* ensure append_fillfactor doesn't assume that it has to append
fillfactor and removed fillfactor < 100 check from it.
* improve the comments around query to fetch partitions
* improve the comments in the patch and make the code look like nearby code
* pgindent the patch

I think we should try to add some note or comment that why we only
choose to partition pgbench_accounts table when the user has given
--partitions option.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


pgbench-init-partitioned-13.patch
Description: Binary data


Re: [WIP] UNNEST(REFCURSOR): allowing SELECT to consume data from a REFCURSOR

2019-09-20 Thread Roman Pekar
Hi John,

Thanks for pushing this, for me it looks like promising start! I need a bit
more time to go through the code (and I'm not an expert in Postgres
internals in any way) but I really appreciate you doing this.

Roman


Re: [PATCH] src/test/modules/dummy_index -- way to test reloptions from inside of access method

2019-09-20 Thread Michael Paquier
On Fri, Sep 20, 2019 at 09:16:58AM +0900, Michael Paquier wrote:
> On Thu, Sep 19, 2019 at 02:13:23PM +0300, Nikolay Shaplov wrote:
>> What a good catch! dummy_index already proved to be useful ;-)
> 
> Yes, the testing around custom reloptions is rather poor now, so this
> module has value.  I still don't like much its shape though, so I
> began hacking on it for integration, and I wanted to get that part
> done in this CF :)

So...  I have looked at the patch of upthread in details, and as I
suspected the module is over-designed.  First, on HEAD the coverage of
reloptions.c is 86.6%, with your patch we get at 94.1%, and with the
attached I reach 95.1% thanks to the addition of a string parameter
with a NULL default value and a NULL description, for roughly half the
code size.

The GUCs are also basically not necessary, as you can just replace the
various WARNING calls (please don't call elog on anything which can be
reached by the user!) by lookups at reloptions in pg_class.  Once this
is removed, the whole code gets more simple, and there is no point in
having neither a separate header nor a different set of files and the
size of the whole module gets really reduced.

I still need to do an extra pass on the code (particularly the AM
part), but I think that we could commit that.  Please note that I
included the fix for the lockmode I sent today so as the patch can be
tested:
https://www.postgresql.org/message-id/20190920013831.gd1...@paquier.xyz

Thoughts?
--
Michael
diff --git a/contrib/bloom/expected/bloom.out b/contrib/bloom/expected/bloom.out
index 5ab9e34f82..dae12a7d3e 100644
--- a/contrib/bloom/expected/bloom.out
+++ b/contrib/bloom/expected/bloom.out
@@ -5,6 +5,7 @@ CREATE TABLE tst (
 );
 INSERT INTO tst SELECT i%10, substr(md5(i::text), 1, 1) FROM generate_series(1,2000) i;
 CREATE INDEX bloomidx ON tst USING bloom (i, t) WITH (col1 = 3);
+ALTER INDEX bloomidx SET (length=80);
 SET enable_seqscan=on;
 SET enable_bitmapscan=off;
 SET enable_indexscan=off;
diff --git a/contrib/bloom/sql/bloom.sql b/contrib/bloom/sql/bloom.sql
index 32755f2b1a..4733e1e705 100644
--- a/contrib/bloom/sql/bloom.sql
+++ b/contrib/bloom/sql/bloom.sql
@@ -7,6 +7,7 @@ CREATE TABLE tst (
 
 INSERT INTO tst SELECT i%10, substr(md5(i::text), 1, 1) FROM generate_series(1,2000) i;
 CREATE INDEX bloomidx ON tst USING bloom (i, t) WITH (col1 = 3);
+ALTER INDEX bloomidx SET (length=80);
 
 SET enable_seqscan=on;
 SET enable_bitmapscan=off;
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 20f4ed3c38..b59e606771 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -659,6 +659,13 @@ allocate_reloption(bits32 kinds, int type, const char *name, const char *desc)
 	newoption->namelen = strlen(name);
 	newoption->type = type;
 
+	/*
+	 * Set the default lock mode for this option.  There is no actual way
+	 * for a module to enforce it when declaring a custom relation option,
+	 * so just use the highest level, which is safe for all cases.
+	 */
+	newoption->lockmode = AccessExclusiveLock;
+
 	MemoryContextSwitchTo(oldcxt);
 
 	return newoption;
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 0e4e53d63e..b2eaef3bff 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -7,6 +7,7 @@ include $(top_builddir)/src/Makefile.global
 SUBDIRS = \
 		  brin \
 		  commit_ts \
+		  dummy_index_am \
 		  dummy_seclabel \
 		  snapshot_too_old \
 		  test_bloomfilter \
diff --git a/src/test/modules/dummy_index_am/.gitignore b/src/test/modules/dummy_index_am/.gitignore
new file mode 100644
index 00..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/dummy_index_am/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/dummy_index_am/Makefile b/src/test/modules/dummy_index_am/Makefile
new file mode 100644
index 00..b7e09b1469
--- /dev/null
+++ b/src/test/modules/dummy_index_am/Makefile
@@ -0,0 +1,20 @@
+# src/test/modules/dummy_index_am/Makefile
+
+MODULES = dummy_index_am
+
+EXTENSION = dummy_index_am
+DATA = dummy_index_am--1.0.sql
+PGFILEDESC = "dummy_index_am - minimized index access method"
+
+REGRESS = reloptions
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/dummy_index_am
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/dummy_index_am/README b/src/test/modules/dummy_index_am/README
new file mode 100644
index 00..5563d060ac
--- /dev/null
+++ b/src/test/modules/dummy_index_am/README
@@ -0,0 +1,11 @@
+Dummy Index AM
+==
+
+Dummy index is an index module for testing any facility usable by an index
+access method, whose code is kept a maximum simple.
+
+This includes tests for all relation option types:
+- boolean
+- integer
+- r

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2019-09-20 Thread Asim R P
On Thu, Sep 19, 2019 at 5:29 PM Asim R P  wrote:
>
> In order to fix the test failures, we need to distinguish between a
missing database directory and a missing tablespace directory.  And also
add logic to forget missing directories during tablespace drop.  I am
working on it.

Please find attached a solution that builds on what Paul has propose.  A
hash table, similar to the invalid page hash table is used to track missing
directory references.  A missing directory may be a tablespace or a
database, based on whether the tablespace is found missing or the source
database is found missing.  The crash recovery succeeds if the hash table
is empty at the end.

Asim


v6-0001-Support-node-initialization-from-backup-with-tabl.patch
Description: Binary data


v6-0002-Tests-to-replay-create-database-operation-on-stan.patch
Description: Binary data


v6-0003-Fix-replay-of-create-database-records-on-standby.patch
Description: Binary data


WAL recycled despite logical replication slot

2019-09-20 Thread Jeff Janes
While testing something else (whether "terminating walsender process due to
replication timeout" was happening spuriously), I had logical replication
set up streaming a default pgbench transaction load, with the publisher
being 13devel-e1c8743 and subscriber being 12BETA4.  Eventually I started
getting errors about requested wal segments being already removed:

10863 sub idle 0 2019-09-19 17:14:58.140 EDT LOG:  starting logical
decoding for slot "sub"
10863 sub idle 0 2019-09-19 17:14:58.140 EDT DETAIL:  Streaming
transactions committing after 79/EB0B17A0, reading WAL from 79/E70736A0.
10863 sub idle 58P01 2019-09-19 17:14:58.140 EDT ERROR:  requested WAL
segment 0001007900E7 has already been removed
10863 sub idle 0 2019-09-19 17:14:58.144 EDT LOG:  disconnection:
session time: 0:00:00.030 user=jjanes database=jjanes host=10.0.2.2
port=40830

It had been streaming for about 50 minutes before the error showed up, and
it showed right when streaming was restarting after one of the replication
timeouts.

Is there an innocent explanation for this?  I thought logical replication
slots provided an iron-clad guarantee that WAL would be retained until it
was no longer needed.  I am just using pub/sub, none of the lower level
stuff.

Cheers,

Jeff


Re: backup manifests

2019-09-20 Thread Robert Haas
On Thu, Sep 19, 2019 at 11:10 PM David Steele  wrote:
> Overall we have found it's much simpler to label each backup and
> cross-check that against the pg version and system id.  Start LSN is
> pretty unique, but backup labels work really well and are more widely
> understood.

I see your point, but part of my point is that uniqueness is not a
technical requirement. However, it may be a requirement for user
comprehension.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Conflict handling for COPY FROM

2019-09-20 Thread Alexey Kondratov

Hi Surafel,

On 16.07.2019 10:08, Surafel Temesgen wrote:

i also add an option to ignore all errors in ERROR set to -1


Great!


The patch still applies cleanly (tested on e1c8743e6c), but I've got 
some problems using more elaborated tests.


First of all, there is definitely a problem with grammar. In docs ERROR 
is defined as option and


COPY test FROM '/path/to/copy-test-simple.csv' ERROR -1;

works just fine, but if modern 'WITH (...)' syntax is used:

COPY test FROM '/path/to/copy-test-simple.csv' WITH (ERROR -1);
ERROR:  option "error" not recognized

while 'WITH (error_limit -1)' it works again.

It happens, since COPY supports modern and very-very old syntax:

* In the preferred syntax the options are comma-separated
* and use generic identifiers instead of keywords.  The pre-9.0
* syntax had a hard-wired, space-separated set of options.

So I see several options here:

1) Everything is left as is, but then docs should be updated and 
reflect, that error_limit is required for modern syntax.


2) However, why do we have to support old syntax here? I guess it exists 
for backward compatibility only, but this is a completely new feature. 
So maybe just 'WITH (error_limit 42)' will be enough?


3) You also may simply change internal option name from 'error_limit' to 
'error' or SQL keyword from 'ERROR' tot 'ERROR_LIMIT'.


I would prefer the second option.


Next, you use DestRemoteSimple for returning conflicting tuples back:

+        dest = CreateDestReceiver(DestRemoteSimple);
+        dest->rStartup(dest, (int) CMD_SELECT, tupDesc);

However, printsimple supports very limited subset of built-in types, so

CREATE TABLE large_test (id integer primary key, num1 bigint, num2 
double precision);

COPY large_test FROM '/path/to/copy-test.tsv';
COPY large_test FROM '/path/to/copy-test.tsv' ERROR 3;

fails with following error 'ERROR:  unsupported type OID: 701', which 
seems to be very confusing from the end user perspective. I've tried to 
switch to DestRemote, but couldn't figure it out quickly.



Finally, I simply cannot get into this validation:

+        else if (strcmp(defel->defname, "error_limit") == 0)
+        {
+            if (cstate->ignore_error)
+                ereport(ERROR,
+                        (errcode(ERRCODE_SYNTAX_ERROR),
+                         errmsg("conflicting or redundant options"),
+                         parser_errposition(pstate, defel->location)));
+            cstate->error_limit = defGetInt64(defel);
+            cstate->ignore_error = true;
+            if (cstate->error_limit == -1)
+                cstate->ignore_all_error = true;
+        }

If cstate->ignore_error is defined, then we have already processed 
options list, since this is the only one place, where it's set. So we 
should never get into this ereport, doesn't it?



Regards

--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company


1100.5
2420.1
300



Re: backup manifests

2019-09-20 Thread Robert Haas
On Thu, Sep 19, 2019 at 11:06 PM David Steele  wrote:
> > I am not crazy about JSON because it requires that I get a json parser
> > into src/common, which I could do, but given the possibly-imminent end
> > of the universe, I'm not sure it's the greatest use of time. You're
> > right that if we pick an ad-hoc format, we've got to worry about
> > escaping, which isn't lovely.
>
> My experience is that JSON is simple to implement and has already dealt
> with escaping and data structure considerations.  A home-grown solution
> will be at least as complex but have the disadvantage of being non-standard.

I think that's fair and just spent a little while investigating how
difficult it would be to disentangle the JSON parser from the backend.
It has dependencies on the following bits of backend-only
functionality:

- check_stack_depth(). No problem, I think.  Just skip it for frontend code.

- pg_mblen() / GetDatabaseEncoding(). Not sure what to do about this.
Some of our infrastructure for dealing with encoding is available in
the frontend and backend, but this part is backend-only.

- elog() / ereport(). Kind of a pain. We could just kill the program
if an error occurs, but that seems a bit ham-fisted. Refactoring the
code so that the error is returned rather than thrown might be the way
to go, but it's not simple, because you're not just passing a string.

ereport(ERROR,

(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
 errmsg("invalid input syntax
for type %s", "json"),
 errdetail("Character with
value 0x%02x must be escaped.",
   (unsigned char) *s),
 report_json_context(lex)));

- appendStringInfo et. al. I don't think it would be that hard to move
this to src/common, but I'm also not sure it really solves the
problem, because StringInfo has a 1GB limit, and there's no rule at
all that a backup manifest has got to be less than 1GB.

https://www.pgcon.org/2013/schedule/events/595.en.html

This gets at another problem that I just started to think about. If
the file is just a series of lines, you can parse it one line and a
time and do something with that line, then move on. If it's a JSON
blob, you have to parse the whole file and get a potentially giant
data structure back, and then operate on that data structure. At
least, I think you do. There's probably some way to create a callback
structure that lets you presuppose that the toplevel data structure is
an array (or object) and get back each element of that array (or
key/value pair) as it's parsed, but that sounds pretty annoying to get
working. Or we could just decide that you have to have enough memory
to hold the parsed version of the entire manifest file in memory all
at once, and if you don't, maybe you should drop some tables or buy
more RAM. That still leaves you with bypassing the 1GB size limit on
StringInfo, maybe by having a "huge" option, or perhaps by
memory-mapping the file and then making the StringInfo point directly
into the mapped region. Perhaps I'm overthinking this and maybe you
have a simpler idea in mind about how it can be made to work, but I
find all this complexity pretty unappealing.

Here's a competing proposal: let's decide that lines consist of
tab-separated fields. If a field contains a \t, \r, or \n, put a " at
the beginning, a " at the end, and double any " that appears in the
middle. This is easy to generate and easy to parse. It lets us
completely ignore encoding considerations. Incremental parsing is
straightforward. Quoting will rarely be needed because there's very
little reason to create a file inside a PostgreSQL data directory that
contains a tab or a newline, but if you do it'll still work.  The lack
of quoting is nice for humans reading the manifest, and nice in terms
of keeping the manifest succinct; in contrast, note that using JSON
doubles every backslash.

I hear you saying that this is going to end up being just as complex
in the end, but I don't think I believe it.  It sounds to me like the
difference between spending a couple of hours figuring this out and
spending a couple of months trying to figure it out and maybe not
actually getting anywhere.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: psql - add SHOW_ALL_RESULTS option

2019-09-20 Thread Fabien COELHO



The below addition can be removed, it seems to be duplicate:


Indeed. I'm unsure how this got into the patch, probably some rebase 
mix-up. Attached v7 removes the duplicates.


--
Fabien.diff --git a/contrib/pg_stat_statements/expected/pg_stat_statements.out b/contrib/pg_stat_statements/expected/pg_stat_statements.out
index 6787ec1efd..de59a5cf65 100644
--- a/contrib/pg_stat_statements/expected/pg_stat_statements.out
+++ b/contrib/pg_stat_statements/expected/pg_stat_statements.out
@@ -49,17 +49,42 @@ BEGIN \;
 SELECT 2.0 AS "float" \;
 SELECT 'world' AS "text" \;
 COMMIT;
+ float 
+---
+   2.0
+(1 row)
+
+ text  
+---
+ world
+(1 row)
+
 -- compound with empty statements and spurious leading spacing
-\;\;   SELECT 3 + 3 \;\;\;   SELECT ' ' || ' !' \;\;   SELECT 1 + 4 \;;
- ?column? 
---
-5
+\;\;   SELECT 3 + 3 AS "+" \;\;\;   SELECT ' ' || ' !' AS "||" \;\;   SELECT 1 + 4 AS "+" \;;
+ + 
+---
+ 6
+(1 row)
+
+ ||  
+-
+   !
+(1 row)
+
+ + 
+---
+ 5
 (1 row)
 
 -- non ;-terminated statements
 SELECT 1 + 1 + 1 AS "add" \gset
 SELECT :add + 1 + 1 AS "add" \;
 SELECT :add + 1 + 1 AS "add" \gset
+ add 
+-
+   5
+(1 row)
+
 -- set operator
 SELECT 1 AS i UNION SELECT 2 ORDER BY i;
  i 
@@ -102,12 +127,12 @@ SELECT query, calls, rows FROM pg_stat_statements ORDER BY query COLLATE "C";
  SELECT $1+| 4 |4
   +|   | 
AS "text"   |   | 
- SELECT $1 + $2| 2 |2
  SELECT $1 + $2 + $3 AS "add"  | 3 |3
+ SELECT $1 + $2 AS "+" | 2 |2
  SELECT $1 AS "float"  | 1 |1
  SELECT $1 AS "int"| 2 |2
  SELECT $1 AS i UNION SELECT $2 ORDER BY i | 1 |2
- SELECT $1 || $2   | 1 |1
+ SELECT $1 || $2 AS "||"   | 1 |1
  SELECT pg_stat_statements_reset() | 1 |1
  WITH t(f) AS (   +| 1 |2
VALUES ($1), ($2)  +|   | 
diff --git a/contrib/pg_stat_statements/sql/pg_stat_statements.sql b/contrib/pg_stat_statements/sql/pg_stat_statements.sql
index 8b527070d4..ea3a31176e 100644
--- a/contrib/pg_stat_statements/sql/pg_stat_statements.sql
+++ b/contrib/pg_stat_statements/sql/pg_stat_statements.sql
@@ -27,7 +27,7 @@ SELECT 'world' AS "text" \;
 COMMIT;
 
 -- compound with empty statements and spurious leading spacing
-\;\;   SELECT 3 + 3 \;\;\;   SELECT ' ' || ' !' \;\;   SELECT 1 + 4 \;;
+\;\;   SELECT 3 + 3 AS "+" \;\;\;   SELECT ' ' || ' !' AS "||" \;\;   SELECT 1 + 4 AS "+" \;;
 
 -- non ;-terminated statements
 SELECT 1 + 1 + 1 AS "add" \gset
diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index 7789fc6177..4e6ab5a0a5 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -127,18 +127,11 @@ echo '\x \\ SELECT * FROM foo;' | psql
commands included in the string to divide it into multiple
transactions.  (See 
for more details about how the server handles multi-query strings.)
-   Also, psql only prints the
-   result of the last SQL command in the string.
-   This is different from the behavior when the same string is read from
-   a file or fed to psql's standard input,
-   because then psql sends
-   each SQL command separately.
   
   
-   Because of this behavior, putting more than one SQL command in a
-   single -c string often has unexpected results.
-   It's better to use repeated -c commands or feed
-   multiple commands to psql's standard input,
+   If having several commands executed in one transaction is not desired, 
+   use repeated -c commands or feed multiple commands to
+   psql's standard input,
either using echo as illustrated above, or
via a shell here-document, for example:
 
@@ -3355,10 +3348,8 @@ select 1\; select 2\; select 3;
 commands included in the string to divide it into multiple
 transactions.  (See 
 for more details about how the server handles multi-query strings.)
-psql prints only the last query result
-it receives for each request; in this example, although all
-three SELECTs are indeed executed, psql
-only prints the 3.
+psql prints all results it receives, one
+after the other.
 
 
   
diff --git a/src/bin/psql/common.c b/src/bin/psql/common.c
index 4b2679360f..db6d031133 100644
--- a/src/bin/psql/common.c
+++ b/src/bin/psql/common.c
@@ -486,6 +486,19 @@ ResetCancelConn(void)
 #endif
 }
 
+/*
+ * Show error message from result, if any, and check connection in passing.
+ */
+static void
+ShowErrorMessage(const PGresult *resul

Re: backup manifests

2019-09-20 Thread Robert Haas
On Thu, Sep 19, 2019 at 11:06 PM David Steele  wrote:
> > I don't think it's a good idea to duplicate the information that's
> > already in the backup_label. Storing two copies of the same
> > information is just an invitation to having to worry about what
> > happens if they don't agree.
>
> OK, but now we have backup_label, tablespace_map,
> ..backup (in the WAL) and now perhaps a
> backup.manifest file.  I feel like we may be drowning in backup info files.

I agree!

I'm not sure what to do about it, though.  The information that is
present in the tablespace_map file could have been stored in the
backup_label file, I think, and that would have made sense, because
both files are serving a very similar purpose: they tell the server
that it needs to do some non-standard stuff when it starts up, and
they give it instructions for what those things are. And, as a
secondary purpose, humans or third-party tools can read them and use
that information for whatever purpose they wish.

The proposed backup_manifest file is a little different. I don't think
that anyone is proposing that the server should read that file: it is
there solely for the purpose of helping our own tools or third-party
tools or human beings who are, uh, acting like tools.[1] We're also
proposing to put it in a different place: the backup_label goes into
one of the tar files, but the backup_manifest would sit outside of any
tar file.

If we were designing this from scratch, maybe we'd roll all of this
into one file that serves as backup manifest, tablespace map, backup
label, and backup history file, but then again, maybe separating the
instructions-to-the-server part from the backup-integrity-checking
part makes sense.  At any rate, even if we knew for sure that's the
direction we wanted to go, getting there from here looks a bit rough.
If we just add a backup manifest, people who don't care can mostly
ignore it and then should be mostly fine. If we start trying to create
the one backup information system to rule them all, we're going to
break people's tools. Maybe that's worth doing someday, but the paint
isn't even dry on removing recovery.conf yet.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] There are a surprising number of installations where, in effect,
the DBA is the backup-and-restore tool, performing all the steps by
hand and hoping not to mess any of them up. The fact that nearly every
PostgreSQL company offers tools to make this easier does not seem to
have done a whole lot to diminish the number of people using ad-hoc
solutions.




Re: Usage of the system truststore for SSL certificate validation

2019-09-20 Thread Isaac Morland
On Thu, 19 Sep 2019 at 12:26, Isaac Morland  wrote:

> If we're going to open this up, can we add an option to say "this key is
> allowed to log in to this account", SSH style?
>
> I like the idea of using keys rather than .pgpass, but I like the
> ~/.ssh/authorized_keys model and don't like the "set up an entire
> certificate infrastructure" approach.
>

 Sorry for the top-post.


Re: backup manifests

2019-09-20 Thread Robert Haas
On Fri, Sep 20, 2019 at 9:46 AM Robert Haas  wrote:
> - appendStringInfo et. al. I don't think it would be that hard to move
> this to src/common, but I'm also not sure it really solves the
> problem, because StringInfo has a 1GB limit, and there's no rule at
> all that a backup manifest has got to be less than 1GB.

Hmm.  That's actually going to be a problem on the server side, no
matter what we do on the client side.  We have to send the manifest
after we send everything else, so that we know what we sent. But if we
sent a lot of files, the manifest might be really huge. I had been
thinking that we would generate the manifest on the server and send it
to the client after everything else, but maybe this is an argument for
generating the manifest on the client side and writing it
incrementally. That would require the client to peek at the contents
of every tar file it receives all the time, which it currently doesn't
need to do, but it does peek inside them a little bit, so maybe it's
OK.

Another alternative would be to have the server spill the manifest in
progress to a temp file and then stream it from there to the client.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: backup manifests

2019-09-20 Thread David Steele
On 9/20/19 9:46 AM, Robert Haas wrote:
> On Thu, Sep 19, 2019 at 11:06 PM David Steele  wrote:
>
>> My experience is that JSON is simple to implement and has already dealt
>> with escaping and data structure considerations.  A home-grown solution
>> will be at least as complex but have the disadvantage of being non-standard.
>
> I think that's fair and just spent a little while investigating how
> difficult it would be to disentangle the JSON parser from the backend.
> It has dependencies on the following bits of backend-only
> functionality:

> - elog() / ereport(). Kind of a pain. We could just kill the program
> if an error occurs, but that seems a bit ham-fisted. Refactoring the
> code so that the error is returned rather than thrown might be the way
> to go, but it's not simple, because you're not just passing a string.

Seems to me we are overdue for elog()/ereport() compatible
error-handling in the front end.  Plus mem contexts.

It sucks to make that a prereq for this project but the longer we kick
that can down the road...

> https://www.pgcon.org/2013/schedule/events/595.en.html

This talk was good fun.  The largest number of tables we've seen is a
few hundred thousand, but that still adds up to more than a million
files to backup.

> This gets at another problem that I just started to think about. If
> the file is just a series of lines, you can parse it one line and a
> time and do something with that line, then move on. If it's a JSON
> blob, you have to parse the whole file and get a potentially giant
> data structure back, and then operate on that data structure. At
> least, I think you do. 

JSON can definitely be parsed incrementally, but for practical reasons
certain structures work better than others.

> There's probably some way to create a callback
> structure that lets you presuppose that the toplevel data structure is
> an array (or object) and get back each element of that array (or
> key/value pair) as it's parsed, but that sounds pretty annoying to get
> working. 

And that's how we do it.  It's annoying and yeah it's complicated but it
is very fast and memory-efficient.

> Or we could just decide that you have to have enough memory
> to hold the parsed version of the entire manifest file in memory all
> at once, and if you don't, maybe you should drop some tables or buy
> more RAM. 

I assume you meant "un-parsed" here?

> That still leaves you with bypassing the 1GB size limit on
> StringInfo, maybe by having a "huge" option, or perhaps by
> memory-mapping the file and then making the StringInfo point directly
> into the mapped region. Perhaps I'm overthinking this and maybe you
> have a simpler idea in mind about how it can be made to work, but I
> find all this complexity pretty unappealing.

Our String object has the same 1GB limit.  Partly because it works and
saves a bit of memory per object, but also because if we find ourselves
exceeding that limit we know we've probably made a design error.

Parsing in stream means that you only need to store the final in-memory
representation of the manifest which can be much more compact.  Yeah,
it's complicated, but the memory and time savings are worth it.

Note that our Perl implementation took the naive approach and has worked
pretty well for six years, but can choke on really large manifests with
out of memory errors.  Overall, I'd say getting the format right is more
important than having the perfect initial implementation.

> Here's a competing proposal: let's decide that lines consist of
> tab-separated fields. If a field contains a \t, \r, or \n, put a " at
> the beginning, a " at the end, and double any " that appears in the
> middle. This is easy to generate and easy to parse. It lets us
> completely ignore encoding considerations. Incremental parsing is
> straightforward. Quoting will rarely be needed because there's very
> little reason to create a file inside a PostgreSQL data directory that
> contains a tab or a newline, but if you do it'll still work.  The lack
> of quoting is nice for humans reading the manifest, and nice in terms
> of keeping the manifest succinct; in contrast, note that using JSON
> doubles every backslash.

There's other information you'll want to store that is not strictly file
info so you need a way to denote that.  It gets complicated quickly.

> I hear you saying that this is going to end up being just as complex
> in the end, but I don't think I believe it.  It sounds to me like the
> difference between spending a couple of hours figuring this out and
> spending a couple of months trying to figure it out and maybe not
> actually getting anywhere.

Maybe the initial implementation will be easier but I am confident we'll
pay for it down the road.  Also, don't we want users to be able to read
this file?  Do we really want them to need to cook up a custom parser in
Perl, Go, Python, etc.?

-- 
-David
da...@pgmasters.net




Re: Global temporary tables

2019-09-20 Thread Konstantin Knizhnik
I have added support of all indexes (brin, btree, gin, gist, hash, 
spgist) for global temp tables (before only B-Tree index was supported).
It will be nice to have some generic mechanism for it, but I do not 
understand how it can look like.
The problem is that normal relations are initialized at the moment of 
their creation.
But for global temp relations metadata already exists while data is 
absent. We should somehow catch such access to not initialized page (but 
not not all pages, but just first page of relation)

and perform initialization on demand.

New patch for global temp tables with shared buffers is attached.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 1bd579f..2d93f6f 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -153,9 +153,9 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
 			buf_state = LockBufHdr(bufHdr);
 
 			fctx->record[i].bufferid = BufferDescriptorGetBuffer(bufHdr);
-			fctx->record[i].relfilenode = bufHdr->tag.rnode.relNode;
-			fctx->record[i].reltablespace = bufHdr->tag.rnode.spcNode;
-			fctx->record[i].reldatabase = bufHdr->tag.rnode.dbNode;
+			fctx->record[i].relfilenode = bufHdr->tag.rnode.node.relNode;
+			fctx->record[i].reltablespace = bufHdr->tag.rnode.node.spcNode;
+			fctx->record[i].reldatabase = bufHdr->tag.rnode.node.dbNode;
 			fctx->record[i].forknum = bufHdr->tag.forkNum;
 			fctx->record[i].blocknum = bufHdr->tag.blockNum;
 			fctx->record[i].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c
index 38ae240..8a04954 100644
--- a/contrib/pg_prewarm/autoprewarm.c
+++ b/contrib/pg_prewarm/autoprewarm.c
@@ -608,9 +608,9 @@ apw_dump_now(bool is_bgworker, bool dump_unlogged)
 		if (buf_state & BM_TAG_VALID &&
 			((buf_state & BM_PERMANENT) || dump_unlogged))
 		{
-			block_info_array[num_blocks].database = bufHdr->tag.rnode.dbNode;
-			block_info_array[num_blocks].tablespace = bufHdr->tag.rnode.spcNode;
-			block_info_array[num_blocks].filenode = bufHdr->tag.rnode.relNode;
+			block_info_array[num_blocks].database = bufHdr->tag.rnode.node.dbNode;
+			block_info_array[num_blocks].tablespace = bufHdr->tag.rnode.node.spcNode;
+			block_info_array[num_blocks].filenode = bufHdr->tag.rnode.node.relNode;
 			block_info_array[num_blocks].forknum = bufHdr->tag.forkNum;
 			block_info_array[num_blocks].blocknum = bufHdr->tag.blockNum;
 			++num_blocks;
diff --git a/contrib/pgrowlocks/pgrowlocks.c b/contrib/pgrowlocks/pgrowlocks.c
index a2c44a9..43b4c66 100644
--- a/contrib/pgrowlocks/pgrowlocks.c
+++ b/contrib/pgrowlocks/pgrowlocks.c
@@ -158,7 +158,8 @@ pgrowlocks(PG_FUNCTION_ARGS)
 		/* must hold a buffer lock to call HeapTupleSatisfiesUpdate */
 		LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
 
-		htsu = HeapTupleSatisfiesUpdate(tuple,
+		htsu = HeapTupleSatisfiesUpdate(mydata->rel,
+		tuple,
 		GetCurrentCommandId(false),
 		hscan->rs_cbuf);
 		xmax = HeapTupleHeaderGetRawXmax(tuple->t_data);
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 70af43e..9cce720 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -349,7 +349,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
 		/* must hold a buffer lock to call HeapTupleSatisfiesVisibility */
 		LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
 
-		if (HeapTupleSatisfiesVisibility(tuple, &SnapshotDirty, hscan->rs_cbuf))
+		if (HeapTupleSatisfiesVisibility(rel, tuple, &SnapshotDirty, hscan->rs_cbuf))
 		{
 			stat.tuple_len += tuple->t_len;
 			stat.tuple_count++;
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index e2bfbf8..97041a8 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -25,6 +25,7 @@
 #include "access/brin_revmap.h"
 #include "access/brin_tuple.h"
 #include "access/brin_xlog.h"
+#include "access/brin.h"
 #include "access/rmgr.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
@@ -79,6 +80,11 @@ brinRevmapInitialize(Relation idxrel, BlockNumber *pagesPerRange,
 	meta = ReadBuffer(idxrel, BRIN_METAPAGE_BLKNO);
 	LockBuffer(meta, BUFFER_LOCK_SHARE);
 	page = BufferGetPage(meta);
+
+	if (GlobalTempRelationPageIsNotInitialized(idxrel, page))
+		brin_metapage_init(page, BrinGetPagesPerRange(idxrel),
+		   BRIN_CURRENT_VERSION);
+
 	TestForOldSnapshot(snapshot, idxrel, page);
 	metadata = (BrinMetaPageData *) PageGetContents(page);
 
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index 439a91b..8a6ac71 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -241,6 +241,16 @@ ginHeapTupleFastInsert(GinState *ginstate, GinTupleCollector *collect

Re: Option to dump foreign data in pg_dump

2019-09-20 Thread Luis Carril
Hello,
   thanks for the comments!


*I suggest the option to be just –foreign-data because if we make it 
–include-foreign-data its expected to have –exclude-foreign-data option too.

Several pg_dump options have no counterpart, e.g --enable-row-security does not 
have a disable (which is the default). Also calling it --foreign-data would 
sound similar to the --table,  by default all tables are dumped, but with 
--table only the selected tables are dumped. While without 
--include-foreign-data all data is excluded, and only with the option some 
foreign data would be included.

*please add test case

I added tests cases for the invalid inputs. I'll try to make a test case for 
the actual dump of foreign data, but that requires more setup, because a 
functional fdw is needed there.

* + if (tdinfo->filtercond || tbinfo->relkind == RELKIND_FOREIGN_TABLE)

filter condition is not implemented completely yet so the logic only work on 
foreign table so I think its better to handle it separately

Note that there is another if condition that actually applies the the 
filtercondition if provided, also for a foreign table we need to do a COPY 
SELECT instead of a COPY TO

* I don’t understand the need for changing SELECT query .we can use the same 
SELECT query syntax for both regular table and foreign table

To which query do you refer? In the patch there are three queries: 1 retrieves 
foreign servers, another is the SELECT in the COPY that now it applies in case 
of a filter condition of a foreign table, and a third that retrieves the oid of 
a given foreign server.


> As you have specified required_argument in above:
> + {"include-foreign-data", required_argument, NULL, 11},
>
> The below check may not be required:
> + if (strcmp(optarg, "") == 0)
> + {
> + pg_log_error("empty string is not a valid pattern in 
> --include-foreign-data");
> + exit_nicely(1);
> + }

We need to conserve this check to avoid that the use of 
'--include-foreign-data=', which would match all foreign servers. And in 
previous messages it was established that that behavior is too coarse.

>
> + if (foreign_servers_include_patterns.head != NULL)
> + {
> + expand_foreign_server_name_patterns(fout, &foreign_servers_include_patterns,
> + &foreign_servers_include_oids);
> + if (foreign_servers_include_oids.head == NULL)
> + fatal("no matching foreign servers were found");
> + }
> +
>
> The above check if (foreign_servers_include_oids.head == NULL) may not
> be required, as there is a check present inside
> expand_foreign_server_name_patterns to handle this error:
> +
> + res = ExecuteSqlQuery(fout, query->data, PGRES_TUPLES_OK);
> + if (PQntuples(res) == 0)
> + fatal("no matching foreign servers were found for pattern \"%s\"", 
> cell->val);
> +

Removed

>
> +static void
> +expand_foreign_server_name_patterns(Archive *fout,
> + SimpleStringList *patterns,
> + SimpleOidList *oids)
> +{
> + PQExpBuffer query;
> + PGresult   *res;
> + SimpleStringListCell *cell;
> + int i;
> +
> + if (patterns->head == NULL)
> + return; /* nothing to do */
> +
>
> The above check for patterns->head may not be required as similar
> check exists before this function is called:
> + if (foreign_servers_include_patterns.head != NULL)
> + {
> + expand_foreign_server_name_patterns(fout, &foreign_servers_include_patterns,
> + &foreign_servers_include_oids);
> + if (foreign_servers_include_oids.head == NULL)
> + fatal("no matching foreign servers were found");
> + }
> +

I think that it is better that the function expand_foreign_server_name do not 
rely on a non-NULL head, so it checks it by itself, and is closer to the other 
expand_* functions.
Instead I've removed the check before the function is called.

>
> + /* Skip FOREIGN TABLEs (no data to dump) if not requested explicitly */
> + if (tbinfo->relkind == RELKIND_FOREIGN_TABLE &&
> + (foreign_servers_include_oids.head == NULL ||
> + !simple_oid_list_member(&foreign_servers_include_oids,
> tbinfo->foreign_server_oid)))
> simple_oid_list_member can be split into two lines

Done

Cheers
Luis M Carril
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 4bcd4bdaef..319851b936 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -767,6 +767,34 @@ PostgreSQL documentation
   
  
 
+ 
+  --include-foreign-data=foreignserver
+  
+   
+Dump the data for any foreign table with a foreign server
+matching foreignserver
+pattern. Multiple foreign servers can be selected by writing multiple --include-foreign-data.
+Also, the foreignserver parameter is
+interpreted as a pattern according to the same rules used by
+psql's \d commands (see ),
+so multiple foreign servers can also be selected by writing wildcard characters
+in the pattern.  When using wildcards, be careful to quote the pattern
+if needed to prevent the shell from expanding the wildcards;  see
+.
+   

Re: backup manifests

2019-09-20 Thread David Steele
On 9/20/19 10:59 AM, Robert Haas wrote:
> On Fri, Sep 20, 2019 at 9:46 AM Robert Haas  wrote:
>> - appendStringInfo et. al. I don't think it would be that hard to move
>> this to src/common, but I'm also not sure it really solves the
>> problem, because StringInfo has a 1GB limit, and there's no rule at
>> all that a backup manifest has got to be less than 1GB.
> 
> Hmm.  That's actually going to be a problem on the server side, no
> matter what we do on the client side.  We have to send the manifest
> after we send everything else, so that we know what we sent. But if we
> sent a lot of files, the manifest might be really huge. I had been
> thinking that we would generate the manifest on the server and send it
> to the client after everything else, but maybe this is an argument for
> generating the manifest on the client side and writing it
> incrementally. That would require the client to peek at the contents
> of every tar file it receives all the time, which it currently doesn't
> need to do, but it does peek inside them a little bit, so maybe it's
> OK.
> 
> Another alternative would be to have the server spill the manifest in
> progress to a temp file and then stream it from there to the client.

This seems reasonable to me.

We keep an in-memory representation which is just an array of structs
and is fairly compact -- 1 million files uses ~150MB of memory.  We just
format and stream this to storage when saving.  Saving is easier than
loading, of course.

-- 
-David
da...@pgmasters.net




Re: backup manifests

2019-09-20 Thread Chapman Flack
On 9/20/19 9:46 AM, Robert Haas wrote:

> least, I think you do. There's probably some way to create a callback
> structure that lets you presuppose that the toplevel data structure is
> an array (or object) and get back each element of that array (or
> key/value pair) as it's parsed,

If a JSON parser does find its way into src/common, it probably wants
to have such an incremental mode available, similar to [2] offered
in the "Jackson" library for Java.

The Jackson developer has propounded a thesis[1] that such a parsing
library ought to offer "Three -- and Only Three" different styles of
API corresponding to three ways of organizing the code using the
library ([2], [3], [4], which also resemble the different APIs
supplied in Java for XML processing).

Regards,
-Chap


[1] http://www.cowtowncoder.com/blog/archives/2009/01/entry_132.html
[2] http://www.cowtowncoder.com/blog/archives/2009/01/entry_137.html
[3] http://www.cowtowncoder.com/blog/archives/2009/01/entry_153.html
[4] http://www.cowtowncoder.com/blog/archives/2009/01/entry_152.html




Re: WAL recycled despite logical replication slot

2019-09-20 Thread Tomas Vondra

On Fri, Sep 20, 2019 at 08:45:34AM -0400, Jeff Janes wrote:

While testing something else (whether "terminating walsender process due to
replication timeout" was happening spuriously), I had logical replication
set up streaming a default pgbench transaction load, with the publisher
being 13devel-e1c8743 and subscriber being 12BETA4.  Eventually I started
getting errors about requested wal segments being already removed:

10863 sub idle 0 2019-09-19 17:14:58.140 EDT LOG:  starting logical
decoding for slot "sub"
10863 sub idle 0 2019-09-19 17:14:58.140 EDT DETAIL:  Streaming
transactions committing after 79/EB0B17A0, reading WAL from 79/E70736A0.
10863 sub idle 58P01 2019-09-19 17:14:58.140 EDT ERROR:  requested WAL
segment 0001007900E7 has already been removed
10863 sub idle 0 2019-09-19 17:14:58.144 EDT LOG:  disconnection:
session time: 0:00:00.030 user=jjanes database=jjanes host=10.0.2.2
port=40830

It had been streaming for about 50 minutes before the error showed up, and
it showed right when streaming was restarting after one of the replication
timeouts.

Is there an innocent explanation for this?  I thought logical replication
slots provided an iron-clad guarantee that WAL would be retained until it
was no longer needed.  I am just using pub/sub, none of the lower level
stuff.



I think you're right - this should not happen with replication slots.
Can you provide more detailed setup instructions, so that I can try to
reproduce and investigate the isssue?

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: subscriptionCheck failures on nightjar

2019-09-20 Thread Tomas Vondra

On Thu, Sep 19, 2019 at 01:23:05PM +0900, Michael Paquier wrote:

On Wed, Sep 18, 2019 at 11:58:08PM +0200, Tomas Vondra wrote:

I kinda suspect it might be just a coincidence that it fails during that
particular test. What likely plays a role here is a checkpoint timing
(AFAICS that's the thing removing the file).  On most systems the tests
complete before any checkpoint is triggered, hence no issue.

Maybe aggressively triggering checkpoints on the running cluter from
another session would do the trick ...


Now that I recall, another thing I forgot to mention on this thread is
that I patched guc.c to reduce the minimum of checkpoint_timeout to
1s.


But even with that change you haven't managed to reproduce the issue,
right? Or am I misunderstanding?

regarss

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2019-09-20 Thread Alvaro Herrera
On 2019-Sep-19, Robert Haas wrote:

> So, earlier in this thread, I suggested making this part of ALTER
> TABLE, and several people seemed to like that idea. Did we have a
> reason for dropping that approach?

Hmm, my own reading of that was to add tablespace changing abilities to
ALTER TABLE *in addition* to this patch, not instead of it.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Global temporary tables

2019-09-20 Thread Pavel Stehule
st 18. 9. 2019 v 12:04 odesílatel Konstantin Knizhnik <
k.knizh...@postgrespro.ru> napsal:

>
>
> On 21.08.2019 11:54, Konstantin Knizhnik wrote:
>
>
>
> On 20.08.2019 20:01, Pavel Stehule wrote:
>
> Another solution is wait on ZHeap storage and replica can to have own UNDO
> log.
>
>>
>> I thought about implementation of special table access method for
>> temporary tables.
>>
>
> +1
>
>
> Unfortunately implementing special table access method for temporary
> tables doesn't solve all problems.
> XID generation is not part of table access methods.
> So we still need to assign some XID to write transaction at replica which
> will not conflict with XIDs received from master.
> Actually only global temp tables can be updated at replica and so assigned
> XIDs can be stored only in tuples of such relations.
> But still I am not sure that we can use arbitrary XID for such
> transactions at replica.
>
> Also I upset by amount of functionality which has to be reimplemented for
> global temp tables if we really want to provide access method for them:
>
> 1. CLOG
> 2. vacuum
> 3. MVCC visibility
>
> And still it is not possible to encapsulate all changes need to support
> writes to temp tables at replica inside table access method.
> XID assignment, transaction commit and abort, subtransactions - all this
> places need to be patched.
>
>
> I was able to fully support work with global temp tables at replica
> (including subtransactions).
> The patch is attached. Also you can find this version in
> https://github.com/postgrespro/postgresql.builtin_pool/tree/global_temp_hot
>
> Right now transactions at replica updating global temp table are assigned
> special kind of GIDs which are not related with XIDs received from master.
> So special visibility rules are used for such tables at replica. Also I
> have to patch TransactionIdIsInProgress, TransactionIdDidCommit,
> TransactionIdGetCurrent
> functions to correctly handle such XIDs. In principle it is possible to
> implement global temp tables as special heap access method. But it will
> require copying a lot of code (heapam.c)
> so I prefer to add few checks to existed functions.
>
> There are still some limitations:
> - Number of transactions at replica which update temp tables is limited by
> 2^32 (wraparound problem is not addressed).
> - I have to maintain in-memory analog of CLOG for such transactions which
> is also not cropped. It means that for 2^32 transaction size of bitmap can
> grow up to  0.5Gb.
>
> I try to understand what are the following steps in global temp tables
> support.
> This is why I want to perform short survey - what people are expecting
> from global temp tables:
>
> 1. I do not need them at all.
> 2. Eliminate catalog bloating.
> 3. Mostly needed for compatibility with Oracle (simplify porting,...).
> 4. Parallel query execution.
> 5. Can be used at replica.
> 6. More efficient use of resources (first of all memory).
>

There can be other point important for cloud. Inside some cloud usually
there are two types of discs - persistent (slow) and ephemeral (fast). We
effectively used temp tables there because we moved temp tablespace to
ephemeral discs.

I missing one point in your list - developer's comfort - using temp tables
is just much more comfortable - you don't need create it again, again, ..
Due this behave is possible to reduce @2 and @3 can be nice side effect. If
you reduce @2 to zero, then @5 should be possible without any other.

Pavel


> --
> Konstantin Knizhnik
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
>


Re: [HACKERS] CLUSTER command progress monitor

2019-09-20 Thread Alvaro Herrera
On 2019-Sep-18, Michael Paquier wrote:

> On Tue, Sep 17, 2019 at 10:50:22PM -0300, Alvaro Herrera wrote:
> > On 2019-Sep-18, Michael Paquier wrote:
> >> So, with the clock ticking and the release getting close by, what do
> >> we do for this set of issues?  REINDEX, CREATE INDEX and CLUSTER all
> >> try to build indexes and the current infrastructure is not really
> >> adapted to hold all that.  Robert, Alvaro and Peter E, do you have any
> >> comments to offer?
> > 
> > Which part of it is not already fixed?
> 
> I can still see at least two problems.  There is one issue with
> pgstat_progress_update_param() which gets called in reindex_table() 
> for a progress phase of CLUSTER, and this even if
> REINDEXOPT_REPORT_PROGRESS is not set in the options.

(You mean reindex_relation.)

... but that param update is there for CLUSTER, not for REINDEX, so if
we made it dependent on the option to turn on CREATE INDEX progress
updates, it would break CLUSTER progress reporting.  Also, the parameter
being updated is not used by CREATE INDEX, so there's no spurious
change.  I think this ain't broke, and thus it don't need fixin'.

If anything, I would like the CLUSTER report to show index creation
progress, which would go the opposite way.  But that seems a patch for
pg13.

> Also it seems
> to me that the calls to pgstat_progress_start_command() and
> pgstat_progress_end_command() are at incorrect locations for
> reindex_index() and that those should be one level higher on the stack
> to avoid any kind of interactions with another command whose progress
> has already started.

That doesn't work, because the caller doesn't have the OID of the table,
which pgstat_progress_start_command() needs.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: A problem presentaion about ECPG, DECLARE STATEMENT

2019-09-20 Thread Tom Lane
"kuroda.hay...@fujitsu.com"  writes:
>> If a solid case can be made that ECPG's DECLARE STATEMENT was done
>> wrong, we'd be better off to just revert the feature out of v12
>> and try again, under less time pressure, for v13.

> I see, I'll propose this at the next commitfest.
> But I'm now considering this commit should be reverted in order to avoid 
> the confusion.

Per this discussion, I've reverted DECLARE STATEMENT out of v12 and HEAD
branches.

One thing that could use more eyeballs on it is the changes in
ecpg_register_prepared_stmt(); that was added after DECLARE STATEMENT
so there was no prior state to revert to, and I had to guess a bit.
What I guessed, seeing that the lone caller of that function is
already using stmt->connection, was that it was completely bogus
for ecpg_register_prepared_stmt() to be doing its own new connection
lookup and it should just use stmt->connection.  But I might be wrong
since I'm not too clear about where connection lookups are supposed
to be done in this library.

Another comment is that this was one of the more painful reverts
I've ever had to do.  Some of the pain was unavoidable because
there were later commits (mostly the PREPARE AS one) changing
adjacent code ... but a lot of it was due to flat-out sloppiness
in the DECLARE STATEMENT patch, particularly with respect to
whitespace.  Please run the next submission through pgindent
beforehand.  Also please pay attention to the documentation cleanups
that other people made after the initial patch.  We don't want to
have to repeat that cleanup work a second time.

regards, tom lane




Re: subscriptionCheck failures on nightjar

2019-09-20 Thread Andres Freund
Hi,

On 2019-09-19 17:20:15 +0530, Kuntal Ghosh wrote:
> It seems there is a pattern how the error is occurring in different
> systems. Following are the relevant log snippets:
> 
> nightjar:
> sub3 LOG:  received replication command: CREATE_REPLICATION_SLOT
> "sub3_16414_sync_16394" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT
> sub3 LOG:  logical decoding found consistent point at 0/160B578
> sub1 PANIC:  could not open file
> "pg_logical/snapshots/0-160B578.snap": No such file or directory
> 
> dromedary scenario 1:
> sub3_16414_sync_16399 LOG:  received replication command:
> CREATE_REPLICATION_SLOT "sub3_16414_sync_16399" TEMPORARY LOGICAL
> pgoutput USE_SNAPSHOT
> sub3_16414_sync_16399 LOG:  logical decoding found consistent point at 
> 0/15EA694
> sub2 PANIC:  could not open file
> "pg_logical/snapshots/0-15EA694.snap": No such file or directory
> 
> 
> dromedary scenario 2:
> sub3_16414_sync_16399 LOG:  received replication command:
> CREATE_REPLICATION_SLOT "sub3_16414_sync_16399" TEMPORARY LOGICAL
> pgoutput USE_SNAPSHOT
> sub3_16414_sync_16399 LOG:  logical decoding found consistent point at 
> 0/15EA694
> sub1 PANIC:  could not open file
> "pg_logical/snapshots/0-15EA694.snap": No such file or directory
> 
> While subscription 3 is created, it eventually reaches to a consistent
> snapshot point and prints the WAL location corresponding to it. It
> seems sub1/sub2 immediately fails to serialize the snapshot to the
> .snap file having the same WAL location.

Since now a number of people (I tried as well), failed to reproduce this
locally, I propose that we increase the log-level during this test on
master. And perhaps expand the set of debugging information. With the
hope that the additional information on the cases encountered on the bf
helps us build a reproducer or, even better, diagnose the issue
directly.  If people agree, I'll come up with a patch.

Greetings,

Andres Freund




Re: log bind parameter values on error

2019-09-20 Thread Andres Freund
Hi,

On 2019-09-18 17:58:53 -0300, Alvaro Herrera wrote:
> --- a/doc/src/sgml/config.sgml
> +++ b/doc/src/sgml/config.sgml
> @@ -6414,6 +6414,23 @@ log_line_prefix = '%m [%p] %q%u@%d/%a '
>
>   
>  
> +  xreflabel="log_parameters_on_error">
> +  log_parameters_on_error (boolean)
> +  
> +   log_parameters_on_error configuration 
> parameter
> +  
> +  
> +  
> +   
> +Controls whether the statement is logged with bind parameter values. 

Trailing whitespace.

I find the "the statement" formulation a bit odd, but can't quite put my
finger on why.

I think it might be worthwhile to cross-reference
log_min_error_statement, as log_parameters_on_error will only take effect when 
the
statement is logged due to log_min_error_statement.


> +It adds some overhead, as postgres will cache textual
> +representations of parameter values in memory for all statements,
> +even if they eventually do not get logged. The default is
> +off. Only superusers can change this setting.

I don't think "cache" is the right descriptor. Usually caching implies
trying to make repeated tasks faster, which isn't the case here.


> +/*
> + * The top-level portal that the client is immediately working with:
> + * creating, binding, executing, or all at one using simple protocol
> + */
> +Portal current_top_portal = NULL;
> +

This strikes me as decidedly not nice. For one this means that we'll not
be able to use this infrastructure for binds that are done serverside,
e.g. as part of plpgsql.  I'm basically inclined to think that
integrating this at the postges.c level is the wrong thing.

It also adds new error handling complexity, which is already quite
substantial. And spreads knowledge of portals to elog.c, which imo
shouldn't have to know about them.

It seems to me that this would really need to be tracked inside the
portal infrastructure. To avoid unnecessary overhead, we could continue
to set the text values in exec_bin_message() in the pformat == 0 case,
using hasTextValues somewhere in the portal code to determine whether
the text representation has to be computed (for other input format, and
internal queries as e.g. generated by plpgsql).

And then the PG_CATCHes in pquery.c can add the errdetail() in the error
cases using an econtext callback.



> +/*
> + * get_portal_bind_params
> + *   Get a string containing parameters data -- used for logging.
> + *
> + * Can return NULL if there are no parameters in the portal or the portal is
> + * not valid, or the text representations of the parameters are not 
> available.
> + * If returning a non-NULL value, it allocates memory for the returned string
> + * in the current context, and it's the caller's responsibility to pfree() 
> it.
> + */
> +char *
> +get_portal_bind_params(ParamListInfo params)
> +{
> + StringInfoData param_str;
> +
> + /* No parameters to format */
> + if (!params || params->numParams == 0)
> + return NULL;
> +
> + /*
> +  * We either need textual representation of parameters pre-calculated, 
> or
> +  * call potentially user-defined I/O functions to convert the internal
> +  * representation into text, which cannot be done in an aborted xact
> +  */
> + if (!params->hasTextValues && IsAbortedTransactionBlockState())
> + return NULL;

Maybe I'm naive, but what is the point of keeping the separate
parameters, allocated separately, when all we're doing is building a
string containing them all at error time? Seems better just directly
form the full string when decideding to keep the text parameters - for
one it'll often end up being more efficient. But more importantly it
also makes it a lot less likely to run out of memory while handling the
error. The individual text parameters can be large, and this will always
additionally need at least the combined size of all parameters from
within the error context. That's not great.


> + appendStringInfoCharMacro(¶m_str, '\'');
> + for (p = pstring; *p; p++)
> + {
> + if (*p == '\'') /* double single quotes */
> + appendStringInfoCharMacro(¶m_str, *p);
> + appendStringInfoCharMacro(¶m_str, *p);
> + }
> + appendStringInfoCharMacro(¶m_str, '\'');

I know this is old code, but: This is really inefficient. Will cause a
lot of unnecessary memory-reallocations for large text outputs, because
we don't immediately grow to at least close to the required size.


Greetings,

Andres Freund




Re: Add "password_protocol" connection parameter to libpq

2019-09-20 Thread Jeff Davis
On Fri, 2019-09-20 at 13:07 +0900, Michael Paquier wrote:
> Those are mainly nits, and attached are the changes I would do to
> your
> patch.  Please feel free to consider those changes as you see fit.
> Anyway, the base logic looks good to me, so I am switching the patch
> as ready for committer.

Thank you, applied.

* I also changed the comment above pg_fe_scram_channel_bound() to
clarify that the caller must also check that the exchange was
successful.

* I changed the error message when AUTH_REQ_OK is received without
channel binding. It seemed confusing before. I also added a test.

Regards,
Jeff Davis

From 1266d7bb6c46aa85b3c48ee271115e5ce6f4bad0 Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Tue, 20 Aug 2019 18:59:23 -0700
Subject: [PATCH] Add libpq parameter 'channel_binding'.

Allow clients to require channel binding to enhance security against
untrusted servers.

Author: Jeff Davis
Discussion: https://postgr.es/m/227015d8417f2b4fef03f8966dbfa5cbcc4f44da.camel%40j-davis.com
---
 doc/src/sgml/libpq.sgml   | 32 ++
 src/interfaces/libpq/fe-auth-scram.c  | 41 ++--
 src/interfaces/libpq/fe-auth.c| 76 +--
 src/interfaces/libpq/fe-auth.h|  1 +
 src/interfaces/libpq/fe-connect.c | 41 +++-
 src/interfaces/libpq/libpq-int.h  |  2 +
 src/test/authentication/t/001_password.pl | 12 +++-
 src/test/ssl/t/002_scram.pl   | 39 +++-
 src/test/ssl/t/SSLServer.pm   |  9 ++-
 9 files changed, 233 insertions(+), 20 deletions(-)

diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index 1189341ca15..c58527b0c3b 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -1122,6 +1122,28 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname
   
  
 
+ 
+  channel_binding
+  
+  
+This option controls the client's use of channel binding. A setting
+of require means that the connection must employ
+channel binding, prefer means that the client will
+choose channel binding if available, and disable
+prevents the use of channel binding. The default
+is prefer if
+PostgreSQL is compiled with SSL support;
+otherwise the default is disable.
+  
+  
+Channel binding is a method for the server to authenticate itself to
+the client. It is only supported over SSL connections
+with PostgreSQL 11 or later servers using
+the SCRAM authentication method.
+  
+  
+ 
+
  
   connect_timeout
   
@@ -6864,6 +6886,16 @@ myEventProc(PGEventId evtId, void *evtInfo, void *passThrough)
  
 
 
+
+ 
+  
+   PGCHANNELBINDING
+  
+  PGCHANNELBINDING behaves the same as the  connection parameter.
+ 
+
+
 
  
   
diff --git a/src/interfaces/libpq/fe-auth-scram.c b/src/interfaces/libpq/fe-auth-scram.c
index 7a8335bf9f8..03f42f06030 100644
--- a/src/interfaces/libpq/fe-auth-scram.c
+++ b/src/interfaces/libpq/fe-auth-scram.c
@@ -119,6 +119,35 @@ pg_fe_scram_init(PGconn *conn,
 	return state;
 }
 
+/*
+ * Return true if channel binding was employed and the scram exchange
+ * completed. This should be used after a successful exchange to determine
+ * whether the server authenticated itself to the client.
+ *
+ * Note that the caller must also ensure that the exchange was actually
+ * successful.
+ */
+bool
+pg_fe_scram_channel_bound(void *opaq)
+{
+	fe_scram_state *state = (fe_scram_state *) opaq;
+
+	/* no SCRAM exchange done */
+	if (state == NULL)
+		return false;
+
+	/* SCRAM exchange not completed */
+	if (state->state != FE_SCRAM_FINISHED)
+		return false;
+
+	/* channel binding mechanism not used */
+	if (strcmp(state->sasl_mechanism, SCRAM_SHA_256_PLUS_NAME) != 0)
+		return false;
+
+	/* all clear! */
+	return true;
+}
+
 /*
  * Free SCRAM exchange status
  */
@@ -225,9 +254,7 @@ pg_fe_scram_exchange(void *opaq, char *input, int inputlen,
 
 			/*
 			 * Verify server signature, to make sure we're talking to the
-			 * genuine server.  XXX: A fake server could simply not require
-			 * authentication, though.  There is currently no option in libpq
-			 * to reject a connection, if SCRAM authentication did not happen.
+			 * genuine server.
 			 */
 			if (verify_server_signature(state))
 *success = true;
@@ -358,7 +385,8 @@ build_client_first_message(fe_scram_state *state)
 		appendPQExpBufferStr(&buf, "p=tls-server-end-point");
 	}
 #ifdef HAVE_PGTLS_GET_PEER_CERTIFICATE_HASH
-	else if (conn->ssl_in_use)
+	else if (conn->channel_binding[0] != 'd' && /* disable */
+			 conn->ssl_in_use)
 	{
 		/*
 		 * Client supports channel binding, but thinks the server does not.
@@ -369,7 +397,7 @@ build_client_first_message(fe_scram_state *state)
 	else
 	{
 		/*
-		 * Client does not support channel binding.
+		 * Client does not support channel binding, or has disabled it.
 		 */
 		appendPQEx

Re: allow online change primary_conninfo

2019-09-20 Thread Andres Freund
Hi,

On 2019-09-19 17:46:06 +0300, Sergei Kornilov wrote:

>   
> -  This parameter can only be set at server start.
> +  This parameter can only be set in the 
> postgresql.conf
> +  file or on the server command line.
>This setting has no effect if the server is not in standby mode.
>   
> + 
> +  If primary_conninfo is changed while WAL 
> receiver is running,
> +  the WAL receiver shuts down and then restarts with new setting,
> +  except when primary_conninfo is an empty string.
> + 

>From the sentence structure it's not clear that "except when
primary_conninfo is an empty string" only applies to "and then restarts
with new setting".


> +/*
> + * Need for restart running WalReceiver due the configuration change.
> + * Suitable only for XLOG_FROM_STREAM source
> + */
> +static bool pendingWalRcvRestart = false;

s/due the/due to a/ (or even "due to the").


> @@ -11862,6 +11869,17 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool 
> randAccess,
>   if (WalRcvStreaming())
>   ShutdownWalRcv();
>  
> + /*
> +  * If wal receiver is requested to 
> restart, we skip the
> +  * next XLOG_FROM_ARCHIVE to 
> immediately starting it.
> +  */
> + if (pendingWalRcvRestart)
> + {
> + lastSourceFailed = true;
> + currentSource = 
> XLOG_FROM_ARCHIVE;
> + continue;
> + }

I can't parse that comment. What does "skipping to starting" mean? I
assume it's just about avoiding wal_retrieve_retry_interval, but I think
the comment ought to be rephrased.

Also, should we really check this before scanning for new timelines?

Why is it the right thing to change to XLOG_FROM_ARCHIVE when we're just
restarting walreceiver? The server might unnecessarily get stuck in
archive based recovery for a long time this way? It seems to me that
we'd need to actually trigger RequestXLogStreaming() in this case.

Greetings,

Andres Freund




Re: backup manifests

2019-09-20 Thread Robert Haas
On Fri, Sep 20, 2019 at 11:09 AM David Steele  wrote:
> Seems to me we are overdue for elog()/ereport() compatible
> error-handling in the front end.  Plus mem contexts.
>
> It sucks to make that a prereq for this project but the longer we kick
> that can down the road...

There are no doubt many patches that would benefit from having more
backend infrastructure exposed in frontend contexts, and I think we're
slowly moving in that direction, but I generally do not believe in
burdening feature patches with major infrastructure improvements.
Sometimes it's necessary, as in the case of parallel query, which
required upgrading a whole lot of backend infrastructure in order to
have any chance of doing something useful. In most cases, however,
there's a way of getting the patch done that dodges the problem.

For example, I think there's a pretty good argument that Heikki's
design for relation forks was a bad one. It's proven to scale poorly
and create performance problems and extra complexity in quite a few
places. It would likely have been better, from a strictly theoretical
point of view, to insist on a design where the FSM and VM pages got
stored inside the relation itself, and the heap was responsible for
figuring out how various pages were being used. When BRIN came along,
we insisted on precisely that design, because it was clear that
further straining the relation fork system was not a good plan.
However, if we'd insisted on that when Heikki did the original work,
it might have delayed the arrival of the free space map for one or
more releases, and we got big benefits out of having that done sooner.
There's nothing stopping someone from writing a patch to get rid of
relation forks and allow a heap AM to have multiple relfilenodes (with
the extra ones used for the FSM and VM) or with multiplexing all the
data inside of a single file. Nobody has, though, because it's hard,
and the problems with the status quo are not so bad as to justify the
amount of development effort that would be required to fix it. At some
point, that problem is probably going to work its way to the top of
somebody's priority list, but it's already been about 10 years since
that all happened and everyone has so far dodged dealing with the
problem, which in turn has enabled them to work on other things that
are perhaps more important.

I think the same principle applies here. It's reasonable to ask the
author of a feature patch to fix issues that are closely related to
the feature in question, or even problems that are not new but would
be greatly exacerbated by the addition of the feature. It's not
reasonable to stack up a list of infrastructure upgrades that somebody
has to do as a condition of having a feature patch accepted that does
not necessarily require those upgrades. I am not convinced that JSON
is actually a better format for a backup manifest (more on that
below), but even if I were, I believe that getting a backup manifest
functionality into PostgreSQL 13, and perhaps incremental backup on
top of that, is valuable enough to justify making some compromises to
make that happen. And I don't mean "compromises" as in "let's commit
something that doesn't work very well;" rather, I mean making design
choices that are aimed at making the project something that is
feasible and can be completed in reasonable time, rather than not.

And saying, well, the backup manifest format *has* to be JSON because
everything else suxxor is not that. We don't have a single other
example of a file that we read and write in JSON format. Extension
control files use a custom format. Backup labels and backup history
files and timeline history files and tablespace map files use custom
formats. postgresql.conf, pg_hba.conf, and pg_ident.conf use custom
formats. postmaster.opts and postmaster.pid use custom formats. If
JSON is better and easier, at least one of the various people who
coded those things up would have chosen to use it, but none of them
did, and nobody's made a serious attempt to convert them to use it.
That might be because we lack the infrastructure for dealing with JSON
and building it is more work than anybody's willing to do, or it might
be because JSON is not actually better for these kinds of use cases,
but either way, it's hard to see why this particular patch should be
burdened with a requirement that none of the previous ones had to
satisfy.

Personally, I'd be intensely unhappy if a motion to convert
postgresql.conf or pg_hba.conf to JSON format gathered enough steam to
be adopted.  It would be darn useful, because you could specify
complex values for options instead of being limited to scalars, but it
would also make the configuration files a lot harder for human beings
to read and grep and the quality of error reporting would probably
decline significantly.  Also, appending a setting to the file,
something which is currently quite simple, would get a lot harder.
Ad-hoc file formats can be problematic, but they can also have real

Re: pgbench - allow to create partitioned tables

2019-09-20 Thread Fabien COELHO


Hello Amit,


I would not bother to create a patch for so small an improvement. This
makes sense in passing because the created function makes it very easy,
but otherwise I'll just drop it.


I would prefer to drop for now.


Attached v13 does that, I added a comment instead. I do not think that it 
is an improvement.



+ else
+ {
+ fprintf(stderr, "unexpected partition method: \"%s\"\n", ps);
+ exit(1);
+ }

If we can't catch that earlier, then it might be better to have some
version-specific checks rather than such obscure code which is
difficult to understand for others.


Hmmm. The code simply checks for the current partitioning and fails if the 
result is unknown, which I understood was what you asked, the previous 
version was just ignoring the result.


The likelyhood of postgres dropping support for range or hash partitions 
seems unlikely.


This issue rather be raised if an older partition-enabled pgbench is run 
against a newer postgres which adds a new partition method. But then I 
cannot guess when a new partition method will be added, so I cannot put a 
guard with a version about something in the future. Possibly, if no new 
method is ever added, the code will never be triggered.



I have made a few modifications in the attached patch.
* move the create partitions related code into a separate function.


Why not. Not sure it is an improvement.


* make the check related to number of partitions consistent i.e check
partitions > 0 apart from where we print which I also want to change
but let us first discuss one of the above points


I switched two instances of >= 1 to > 0, which had 1 instance before.


* when we don't found pgbench_accounts table, error out instead of continuing


I do not think that it is a a good idea, but I did it anyway to move 
things forward.



* ensure append_fillfactor doesn't assume that it has to append
fillfactor and removed fillfactor < 100 check from it.


Done, which is too bad.


* improve the comments around query to fetch partitions


What? How?

There are already quite a few comments compared to the length of the 
query.


* improve the comments in the patch and make the code look like nearby 
code


This requirement is to fuzzy. I re-read the changes, and both code and 
comments look okay to me.



* pgindent the patch


Done.


I think we should try to add some note or comment that why we only
choose to partition pgbench_accounts table when the user has given
--partitions option.


Added as a comment on the initPartition function.

--
Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index c857aa3cba..e3a0abb4c7 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -306,6 +306,31 @@ pgbench  options  d
   
  
 
+ 
+  --partitions=NUM
+  
+   
+Create a partitioned pgbench_accounts table with
+NUM partitions of nearly equal size for
+the scaled number of accounts.
+Default is 0, meaning no partitioning.
+   
+  
+ 
+
+ 
+  --partition-method=NAME
+  
+   
+Create a partitioned pgbench_accounts table with
+NAME method.
+Expected values are range or hash.
+This option requires that --partitions is set to non-zero.
+If unspecified, default is range.
+   
+  
+ 
+
  
   --tablespace=tablespace
   
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index ed7652bfbf..10eadd8e96 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -186,6 +186,19 @@ int64		latency_limit = 0;
 char	   *tablespace = NULL;
 char	   *index_tablespace = NULL;
 
+/* partitioning for pgbench_accounts table, 0 for no partitioning */
+static int	partitions = 0;
+
+typedef enum
+{
+	PART_NONE, PART_RANGE, PART_HASH
+}
+
+			partition_method_t;
+
+static partition_method_t partition_method = PART_NONE;
+static const char *PARTITION_METHOD[] = {"none", "range", "hash"};
+
 /* random seed used to initialize base_random_sequence */
 int64		random_seed = -1;
 
@@ -617,6 +630,9 @@ usage(void)
 		   "  --foreign-keys   create foreign key constraints between tables\n"
 		   "  --index-tablespace=TABLESPACE\n"
 		   "   create indexes in the specified tablespace\n"
+		   "  --partitions=NUM partition pgbench_accounts in NUM parts (default: 0)\n"
+		   "  --partition-method=(range|hash)\n"
+		   "   partition pgbench_accounts with this method (default: range)\n"
 		   "  --tablespace=TABLESPACE  create tables in the specified tablespace\n"
 		   "  --unlogged-tablescreate tables as unlogged tables\n"
 		   "\nOptions to select what to run:\n"
@@ -3601,6 +3617,80 @@ initDropTables(PGconn *con)
 	 "pgbench_tellers");
 }
 
+/*
+ * add fillfactor percent option.
+ */
+static void
+append_fillfactor(char *opts, int len)
+{
+	/* as default is 100, it could be removed in this case */
+	sn

Re: Efficient output for integer types

2019-09-20 Thread David Fetter
On Wed, Sep 18, 2019 at 04:27:46PM +0900, Kyotaro Horiguchi wrote:
> Hello.
> 
> At Wed, 18 Sep 2019 05:42:01 +0200, David Fetter  wrote in 
> <20190918034201.gx31...@fetter.org>
> > On Tue, Sep 17, 2019 at 09:01:57AM +0200, David Fetter wrote:
> > > On Tue, Sep 17, 2019 at 08:55:05AM +0200, David Fetter wrote:
> > > > On Sun, Sep 15, 2019 at 09:18:49AM +0200, David Fetter wrote:
> > > > > Folks,
> > > > > 
> > > > > Please find attached a couple of patches intended to $subject.
> > > > > 
> > > > > This patch set cut the time to copy ten million rows of randomly sized
> > > > > int8s (10 of them) by about a third, so at least for that case, it's
> > > > > pretty decent.
> > > > 
> > > > Added int4 output, removed the sprintf stuff, as it didn't seem to
> > > > help in any cases I was testing.
> > > 
> > > Found a couple of "whiles" that should have been "ifs."
> > 
> > Factored out some inefficient functions and made the guts use the more
> > efficient function.
> 
> I'm not sure this is on the KISS principle, but looked it and
> have several random comments.
> 
> +numutils.o: CFLAGS += $(PERMIT_DECLARATION_AFTER_STATEMENT)
> 
> I don't think that we are allowing that as project coding
> policy. It seems to have been introduced only to accept external
> code as-is.

Changed to fit current policy.

> -charstr[23];/* sign, 21 digits and '\0' */
> +charstr[MAXINT8LEN];
> 
> It's uneasy that MAXINT8LEN contains tailling NUL. MAXINT8BUFLEN
> can be so. I think MAXINT8LEN should be 20 and the definition
> should be str[MAXINT8LEN + 1].

Done.

> +static const char DIGIT_TABLE[200] = {
> + '0', '0', '0', '1', '0', '2', '0', '3', '0', '4', '0', '5', '0', '6', 
> '0', '7', '0', '8', '0', '9',
> 
> Wouldn't it be simpler if it were defined as a constant string?
> 
> static const char DIGIT_TABLE[201] =
> "00010203040519"
> "20212223242539"
> ..

I thought this might be even clearer:

"00" "01" "02" "03" "04" "05" "06" "07" "08" "09"
"10" "11" "12" "13" "14" "15" "16" "17" "18" "19"
"20" "21" "22" "23" "24" "25" "26" "27" "28" "29"
"30" "31" "32" "33" "34" "35" "36" "37" "38" "39"
"40" "41" "42" "43" "44" "45" "46" "47" "48" "49"
"50" "51" "52" "53" "54" "55" "56" "57" "58" "59"
"60" "61" "62" "63" "64" "65" "66" "67" "68" "69"
"70" "71" "72" "73" "74" "75" "76" "77" "78" "79"
"80" "81" "82" "83" "84" "85" "86" "87" "88" "89"
"90" "91" "92" "93" "94" "95" "96" "97" "98" "99";

> +pg_ltoa_n(int32 value, char *a)
> ...
> + /* Compute the result string. */
> + while (value >= 1)
> 
> We have only two degits above the value. Isn't the stuff inside
> the while a waste of cycles?

Changed the while to an if.

> + /* Expensive 64-bit division. Optimize? */
> 
> I believe compiler treats such trivial optimizations. (concretely
> converts into shifts and subtractons if faster.)

Comments removed.

> + memcpy(a + olength - i - 2, DIGIT_TABLE + c0, 2);
> 
> Maybe it'd be easy to read if 'a + olength - i' is a single variable.

Done.

> + i += adjust;
> + return i;
> 
> If 'a + olength - i' is replaced with a variable, the return
> statement is replacable with "return olength + adjust;".

I'm not sure I understand this.

> + return t + (v >= PowersOfTen[t]);
> 
> I think it's better that if it were 't - (v < POT[t]) + 1; /*
> log10(v) + 1 */'. At least we need an explanation of the
> difference.  (I'didn't checked the algorithm is truely right,
> though.)

Comments added.

> > void
> > pg_lltoa(int64 value, char *a)
> > {
> ..
> > memcpy(a, "-9223372036854775808", 21);
> ..
> > memcpy(a, "0", 2);
> 
> The lines need a comment like "/* length contains trailing '\0'
> */"

Comments added.

> + if (value >= 0)
> ...
> + else
> + {
> + if (value == PG_INT32_MIN)
> + memcpy(str, "-2147483648", 11);
> + return str + 11;
> > }
> + *str++ = '-';
> + return pg_ltostr_zeropad(str, -value, minwidth - 1);
> 
> If then block of the if statement were (values < 0), we won't
> need to reenter the functaion.

This is a tail-call recursion, so it's probably optimized already.

> + len = pg_ltoa_n(value, str);
> + if (minwidth <= len)
> + return str + len;
> +
> + memmove(str + minwidth - len, str, len);
> 
> If the function had the parameters str with the room only for two
> digits and a NUL, 2 as minwidth but 1000 as value, the function
> would overrun the buffer. The original function just ignores
> overflowing digits.

I believe the original was incorrect.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
>From 30d8a2b1bd2f6094a5cb4dd8acb9cc36c837802a Mon Sep 17 00:00:00 2001
From: David Fetter 
Date: Sun, 15 Sep 2019 00:

Re: log bind parameter values on error

2019-09-20 Thread Alvaro Herrera
Hi, thanks for looking.

On 2019-Sep-20, Andres Freund wrote:

> On 2019-09-18 17:58:53 -0300, Alvaro Herrera wrote:

> > +  > xreflabel="log_parameters_on_error">
> > +  log_parameters_on_error 
> > (boolean)
> > +  
> > +   log_parameters_on_error configuration 
> > parameter
> > +  
> > +  
> > +  
> > +   
> > +Controls whether the statement is logged with bind parameter 
> > values. 
> 
> Trailing whitespace.
> 
> I find the "the statement" formulation a bit odd, but can't quite put my
> finger on why.

Yeah, I think that wording is pretty confusing.  I would use "Controls
whether bind parameters are logged when a statement is logged."

> > +/*
> > + * The top-level portal that the client is immediately working with:
> > + * creating, binding, executing, or all at one using simple protocol
> > + */
> > +Portal current_top_portal = NULL;
> > +
> 
> This strikes me as decidedly not nice. For one this means that we'll not
> be able to use this infrastructure for binds that are done serverside,
> e.g. as part of plpgsql.  I'm basically inclined to think that
> integrating this at the postges.c level is the wrong thing.
[...]
> It seems to me that this would really need to be tracked inside the
> portal infrastructure.

I think that's how this was done at first, then Peter E drove him away
from that into the current design.

> It also adds new error handling complexity, which is already quite
> substantial. And spreads knowledge of portals to elog.c, which imo
> shouldn't have to know about them.

Makes sense.

> > +   appendStringInfoCharMacro(¶m_str, '\'');
> > +   for (p = pstring; *p; p++)
> > +   {
> > +   if (*p == '\'') /* double single quotes */
> > +   appendStringInfoCharMacro(¶m_str, *p);
> > +   appendStringInfoCharMacro(¶m_str, *p);
> > +   }
> > +   appendStringInfoCharMacro(¶m_str, '\'');
> 
> I know this is old code, but: This is really inefficient. Will cause a
> lot of unnecessary memory-reallocations for large text outputs, because
> we don't immediately grow to at least close to the required size.

Agreed, but we can't blame a patch because it moves around some old
crufty code.  If Alexey wants to include another patch to change this to
a better formulation, I'm happy to push that before or after his main
patch.  And if he doesn't want to, that's fine with me too.

(Is doing a strlen first to enlarge the stringinfo an overall better
approach?)  (I wonder if it would make sense to strchr each ' and memcpy
the intervening bytes ... if only that didn't break the SI abstraction
completely ...)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Write visibility map during CLUSTER/VACUUM FULL

2019-09-20 Thread Andres Freund
Hi,

On 2019-09-13 22:22:50 +0300, Alexander Korotkov wrote:
> On Thu, Sep 12, 2019 at 4:55 PM Alexander Korotkov
>  wrote:
> > On Wed, Sep 11, 2019 at 3:30 PM Amit Kapila  wrote:
> > > On Sun, Sep 1, 2019 at 1:37 PM Alexander Korotkov
> > >  wrote:
> > > > I found it weird that CLUSTER/VACUUM FULL don't write visibility map.
> > > > Attached patch implements writing visibility map in
> > > > heapam_relation_copy_for_cluster().
> > > >
> > > > I've studied previous attempt to implement this [1].  The main problem
> > > > of that attempt was usage of existing heap_page_is_all_visible() and
> > > > visibilitymap_set() functions.  These functions works through buffer
> > > > manager, while heap rewriting is made bypass buffer manager.
> > > >
> > > > In my patch visibility map pages are handled in the same way as heap
> > > > pages are.

Why do you want to do that? To me that doesn't immediately seems like a
good idea, and I've not seen justification for it in this thread. Did I
miss something?


> > > I haven't studied this patch in detail, but while glancing I observed
> > > that this doesn't try to sync the vm pages as we do for heap pages in
> > > the end (during end_heap_rewrite).  Am I missing something?
> >
> > You're not missed anything.  Yes, VM need sync.  Will fix this.  And I
> > just noticed I need a closer look to what is going on with TOAST.
> 
> Attached patch syncs VM during end_heap_rewrite().
> 
> However, VM for TOAST still isn't read.  It appear to be much more
> difficult to write VM for TOAST, because it's written by insertion
> tuples one-by-one.  Despite it seems to fill TOAST heap pages
> sequentially (assuming no FSM exists yet), it's quite hard to handle
> page-switching event with reasonable level of abstraction.
> Nevertheless, I find this patch useful in current shape.  Even if we
> don't write VM for TOAST, it's still useful to do for regular heap.
> Additionally, one of key advantages of having VM is index-only scan,
> which don't work for TOAST anyway.

Have you looked at the discussion around
https://www.postgresql.org/message-id/20190408010427.4l63qr7h2fjcyp77%40alap3.anarazel.de
?

That's not quite the same thing as you're tackling here, but I wonder if
some of the logic could be the same. Especially for toast.


> +/* Write contents of VM page */
> +static void
> +rewrite_flush_vm_page(RewriteState state)
> +{
> + Assert(state->rs_vm_buffer_valid);
> +
> + if (state->rs_use_wal)
> + log_newpage(&state->rs_new_rel->rd_node,
> + VISIBILITYMAP_FORKNUM,
> + state->rs_vm_blockno,
> + state->rs_vm_buffer,
> + true);
> + RelationOpenSmgr(state->rs_new_rel);
> +
> + PageSetChecksumInplace(state->rs_vm_buffer, state->rs_vm_blockno);
> +
> + smgrextend(state->rs_new_rel->rd_smgr, VISIBILITYMAP_FORKNUM,
> +state->rs_vm_blockno, (char *) state->rs_vm_buffer, 
> true);
> +
> + state->rs_vm_buffer_valid = false;
> +}
> +
> +/* Set VM flags to the VM page */
> +static void
> +rewrite_set_vm_flags(RewriteState state)
> +{
> + BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(state->rs_blockno);
> + uint32  mapByte = HEAPBLK_TO_MAPBYTE(state->rs_blockno);
> + uint8   mapOffset = HEAPBLK_TO_OFFSET(state->rs_blockno);
> + char   *map;
> + uint8   flags;
> +
> + if (mapBlock != state->rs_vm_blockno && state->rs_vm_buffer_valid)
> + rewrite_flush_vm_page(state);
> +
> + if (!state->rs_vm_buffer_valid)
> + {
> + PageInit(state->rs_vm_buffer, BLCKSZ, 0);
> + state->rs_vm_blockno = mapBlock;
> + state->rs_vm_buffer_valid = true;
> + }
> +
> + flags = (state->rs_all_visible ? VISIBILITYMAP_ALL_VISIBLE : 0) |
> + (state->rs_all_frozen ? VISIBILITYMAP_ALL_FROZEN : 0);
> +
> + map = PageGetContents(state->rs_vm_buffer);
> + map[mapByte] |= (flags << mapOffset);
> +}

I think it's a bad idea to copy this many VM implementation details into
rewriteheap.c.


> +/*
> + * Update rs_all_visible and rs_all_frozen flags according to the tuple.  We
> + * use simplified check assuming that HeapTupleSatisfiesVacuum() should 
> already
> + * set tuple hint bits.
> + */
> +static void
> +rewrite_update_vm_flags(RewriteState state, HeapTuple tuple)
> +{
> + TransactionId   xmin;
> +
> + if (!state->rs_all_visible)
> + return;
> +
> + if (!HeapTupleHeaderXminCommitted(tuple->t_data))
> + {
> + state->rs_all_visible = false;
> + state->rs_all_frozen = false;
> + return;
> + }
> +
> + xmin = HeapTupleHeaderGetXmin(tuple->t_data);
> + if (!TransactionIdPrecedes(xmin, state->rs_oldest_xmin))
> + {
> + state->rs_all_visible = false;
> + state->rs_all_frozen = false;
> + 

Re: Unwanted expression simplification in PG12b2

2019-09-20 Thread Robert Haas
On Wed, Jul 17, 2019 at 5:20 PM Darafei "Komяpa" Praliaskouski
 wrote:
> Indeed, it seems I failed to minimize my example.
>
> Here is the actual one, on 90GB table with 16M rows:
> https://gist.github.com/Komzpa/8d5b9008ad60f9ccc62423c256e78b4c
>
> I can share the table on request if needed, but hope that plan may be enough.

[ replying to an old thread ]

I think that this boils down to a lack of planner smarts about target
lists. The planner currently assumes that any given relation - which
for planner purposes might be an actual table or might be the result
of joining multiple tables, aggregating something, running a subquery,
etc. - more or less has one thing that it's supposed to produce. It
only tries to generate plans that produce that target list. There's
some support in there for the idea that there might be various paths
for the same relation that produce different answers, but I don't know
of that actually being used anywhere (but it might be).

What I taught the planner to do here had to do with making the costing
more accurate for cases like this. It now figures out that if it's
going to stick a Gather in at that point, computing the expressions
below the Gather rather than above the Gather makes a difference to
the cost of that plan vs. other plans. However, it still doesn't
consider any more paths than it did before; it just costs them more
accurately. In your first example, I believe that the planner should
be able to consider both GroupAggregate -> Gather Merge -> Sort ->
Parallel Seq Scan and GroupAggregate -> Sort -> Gather -> Parallel Seq
Scan, but I think it's got a fixed idea about which fields should be
fed into the Sort. In particular, I believe it thinks that sorting
more data is so undesirable that it doesn't want to carry any
unnecessary baggage through the Sort for any reason. To solve this
problem, I think it would need to cost the second plan with projection
done both before the Sort and after the Sort and decide which one was
cheaper.

This class of problem is somewhat annoying in that the extra planner
cycles and complexity to deal with getting this right would be useless
for many queries, but at the same time, there are a few cases where it
can win big. I don't know what to do about that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: subscriptionCheck failures on nightjar

2019-09-20 Thread Tom Lane
Andres Freund  writes:
> Since now a number of people (I tried as well), failed to reproduce this
> locally, I propose that we increase the log-level during this test on
> master. And perhaps expand the set of debugging information. With the
> hope that the additional information on the cases encountered on the bf
> helps us build a reproducer or, even better, diagnose the issue
> directly.  If people agree, I'll come up with a patch.

I recreated my freebsd-9-under-qemu setup and I can still reproduce
the problem, though not with high reliability (order of 1 time in 10).
Anything particular you want logged?

regards, tom lane




Re: log bind parameter values on error

2019-09-20 Thread Andres Freund
Hi,

On 2019-09-20 16:56:57 -0300, Alvaro Herrera wrote:
> > > +/*
> > > + * The top-level portal that the client is immediately working with:
> > > + * creating, binding, executing, or all at one using simple protocol
> > > + */
> > > +Portal current_top_portal = NULL;
> > > +
> > 
> > This strikes me as decidedly not nice. For one this means that we'll not
> > be able to use this infrastructure for binds that are done serverside,
> > e.g. as part of plpgsql.  I'm basically inclined to think that
> > integrating this at the postges.c level is the wrong thing.
> [...]
> > It seems to me that this would really need to be tracked inside the
> > portal infrastructure.
> 
> I think that's how this was done at first, then Peter E drove him away
> from that into the current design.

I don't think it really was the way I am suggesting. There were a bunch
of helper functions managing current_top_portal, but it otherwise is
(and afaict was in all versions) still all postgres.c
controlled. Whereas I think it should be nearly exclusively be handled
by pquery.c, with the possible exception of an efficiency hack to reuse
client input string when they're in text format.

What I'm suggesting is that PortalStart() would build a string
representation out of the parameter list (using
ParamExternData.textValue if set, calling the output function
otherwise), and stash that in the portal.

At the start of all the already existing PG_TRY blocks in pquery.c we
push an element onto the error_context_stack that adds the errdetail
with the parameters to the error.  Alternatively we could also add them
in in the PG_CATCH blocks, but that'd only work for elevel == ERROR
(i.e. neither FATAL nor non throwing log levels would be able to enrich
the error).


Thinking about this: I think the current approach doesn't actually
handle recursive errors correctly. Even if we fail to emit the error
message due to the parameter details requiring a lot of memory, we'd
again do so (and again fail) when handling that OOM error, because
current_top_portal afaict would still be set.  At the very least this'd
need to integrate with the recursion_depth logic in elog.c.


> > > + appendStringInfoCharMacro(¶m_str, '\'');
> > > + for (p = pstring; *p; p++)
> > > + {
> > > + if (*p == '\'') /* double single quotes */
> > > + appendStringInfoCharMacro(¶m_str, *p);
> > > + appendStringInfoCharMacro(¶m_str, *p);
> > > + }
> > > + appendStringInfoCharMacro(¶m_str, '\'');
> > 
> > I know this is old code, but: This is really inefficient. Will cause a
> > lot of unnecessary memory-reallocations for large text outputs, because
> > we don't immediately grow to at least close to the required size.
> 
> Agreed, but we can't blame a patch because it moves around some old
> crufty code.  If Alexey wants to include another patch to change this to
> a better formulation, I'm happy to push that before or after his main
> patch.  And if he doesn't want to, that's fine with me too.

Well, this patch makes it potentially a considerably hotter path, so I
think there's some justification for pushing a bit. But I'd not require
it either.

As I said, I think we cannot generate this string at error time, because
it makes it much much more likely to exhaust the error context - a bad
thing.


> (Is doing a strlen first to enlarge the stringinfo an overall better
> approach?)

Yes, it'd be better.


> (I wonder if it would make sense to strchr each ' and memcpy the
> intervening bytes ... if only that didn't break the SI abstraction
> completely ...)

I'd probably just count the ' in one pass, enlarge the stringinfo to the
required size, and then put the string directly into he stringbuffer.
Possibly just putting the necessary code into stringinfo.c. We already
have multiple copies of this inefficient logic...

But even if not, I don't think writing data into the stringbuf directly
is that ugly, we do that in enough places that I'd argue that that's
basically part of how it's expected to be used.

In HEAD there's at least
- postgres.c:errdetail_params(),
- pl_exec.c:format_preparedparamsdata()
  pl_exec.c:format_expr_params()
and
- dblink.c:escape_param_str()
- deparse.c:deparseStringLiteral()
- xlog.c:do_pg_start_backup() (after "Add the escape character"),
- tsearchcmds.c:serialize_deflist()
- ruleutils.c:simple_quote_literal()

are nearly the same.

Greetings,

Andres Freund




Wrong results using initcap() with non normalized string

2019-09-20 Thread Juan José Santamaría Flecha
Hello,

I have come around a strange situation when using a unicode string
that has non normalized characters. The attached script 'initcap.sql'
can reproduce the problem.

The attached patch can fix the issue.

Regards,

Juan José Santamaría Flecha


initcap.sql
Description: application/sql
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index 755ca6e..9f8becf 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -96,6 +96,7 @@
 #include "utils/memutils.h"
 #include "utils/numeric.h"
 #include "utils/pg_locale.h"
+#include "common/unicode_norm.h"
 
 /* --
  * Routines type
@@ -1864,7 +1865,8 @@ str_initcap(const char *buff, size_t nbytes, Oid collid)
 			workspace[curr_char] = towlower_l(workspace[curr_char], mylocale->info.lt);
 		else
 			workspace[curr_char] = towupper_l(workspace[curr_char], mylocale->info.lt);
-		wasalnum = iswalnum_l(workspace[curr_char], mylocale->info.lt);
+		if (!is_pg_wchar_combining(workspace[curr_char]))
+			wasalnum = iswalnum_l(workspace[curr_char], mylocale->info.lt);
 	}
 	else
 #endif
@@ -1873,7 +1875,8 @@ str_initcap(const char *buff, size_t nbytes, Oid collid)
 			workspace[curr_char] = towlower(workspace[curr_char]);
 		else
 			workspace[curr_char] = towupper(workspace[curr_char]);
-		wasalnum = iswalnum(workspace[curr_char]);
+		if (!is_pg_wchar_combining(workspace[curr_char]))
+			wasalnum = iswalnum(workspace[curr_char]);
 	}
 }
 
diff --git a/src/common/unicode_norm.c b/src/common/unicode_norm.c
index 89c5533..25b149b 100644
--- a/src/common/unicode_norm.c
+++ b/src/common/unicode_norm.c
@@ -435,3 +435,14 @@ unicode_normalize_kc(const pg_wchar *input)
 
 	return recomp_chars;
 }
+
+bool
+is_pg_wchar_combining(const pg_wchar current)
+{
+	pg_unicode_decomposition *currEntry = get_code_entry(current);
+	if (currEntry == NULL)
+		return false;
+	if (currEntry->comb_class == 0x0)
+		return false;
+	return true;
+}
diff --git a/src/include/common/unicode_norm.h b/src/include/common/unicode_norm.h
index 99167d2..bdcf02e 100644
--- a/src/include/common/unicode_norm.h
+++ b/src/include/common/unicode_norm.h
@@ -17,5 +17,6 @@
 #include "mb/pg_wchar.h"
 
 extern pg_wchar *unicode_normalize_kc(const pg_wchar *input);
+extern bool is_pg_wchar_combining(const pg_wchar current);
 
 #endif			/* UNICODE_NORM_H */


Re: Efficient output for integer types

2019-09-20 Thread David Fetter
On Fri, Sep 20, 2019 at 09:14:51PM +0200, David Fetter wrote:
> On Wed, Sep 18, 2019 at 04:27:46PM +0900, Kyotaro Horiguchi wrote:
> > Hello.
> > 
> > At Wed, 18 Sep 2019 05:42:01 +0200, David Fetter  wrote 
> > in <20190918034201.gx31...@fetter.org>
> > > On Tue, Sep 17, 2019 at 09:01:57AM +0200, David Fetter wrote:
> > > > On Tue, Sep 17, 2019 at 08:55:05AM +0200, David Fetter wrote:
> > > > > On Sun, Sep 15, 2019 at 09:18:49AM +0200, David Fetter wrote:
> > > > > > Folks,
> > > > > > 
> > > > > > Please find attached a couple of patches intended to $subject.
> > > > > > 
> > > > > > This patch set cut the time to copy ten million rows of randomly 
> > > > > > sized
> > > > > > int8s (10 of them) by about a third, so at least for that case, it's
> > > > > > pretty decent.
> > > > > 
> > > > > Added int4 output, removed the sprintf stuff, as it didn't seem to
> > > > > help in any cases I was testing.
> > > > 
> > > > Found a couple of "whiles" that should have been "ifs."
> > > 
> > > Factored out some inefficient functions and made the guts use the more
> > > efficient function.
> > 
> > I'm not sure this is on the KISS principle, but looked it and
> > have several random comments.
> > 
> > +numutils.o: CFLAGS += $(PERMIT_DECLARATION_AFTER_STATEMENT)

Oops.  Missed a few.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
>From f7f9cb42ce9d50e5a4746fe80208960e29ffd348 Mon Sep 17 00:00:00 2001
From: David Fetter 
Date: Sun, 15 Sep 2019 00:06:29 -0700
Subject: [PATCH v8] Make int4 and int8 operations more efficent
To: hackers
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="2.21.0"

This is a multi-part message in MIME format.
--2.21.0
Content-Type: text/plain; charset=UTF-8; format=fixed
Content-Transfer-Encoding: 8bit


- Output routines now do more digits per iteration, and
- Code determines the number of decimal digits in int4/int8 efficiently
- Split off pg_ltoa_n from pg_ltoa
- Use same to make other functions shorter

diff --git a/src/backend/access/common/printsimple.c b/src/backend/access/common/printsimple.c
index 651ade14dd..5c5b6d33b2 100644
--- a/src/backend/access/common/printsimple.c
+++ b/src/backend/access/common/printsimple.c
@@ -112,7 +112,7 @@ printsimple(TupleTableSlot *slot, DestReceiver *self)
 			case INT8OID:
 {
 	int64		num = DatumGetInt64(value);
-	char		str[23];	/* sign, 21 digits and '\0' */
+	char		str[MAXINT8LEN + 1];
 
 	pg_lltoa(num, str);
 	pq_sendcountedtext(&buf, str, strlen(str), false);
diff --git a/src/backend/utils/adt/int8.c b/src/backend/utils/adt/int8.c
index 0ff9394a2f..6230807906 100644
--- a/src/backend/utils/adt/int8.c
+++ b/src/backend/utils/adt/int8.c
@@ -27,8 +27,6 @@
 #include "utils/builtins.h"
 
 
-#define MAXINT8LEN		25
-
 typedef struct
 {
 	int64		current;
diff --git a/src/backend/utils/adt/numutils.c b/src/backend/utils/adt/numutils.c
index 70138feb29..fef6da672b 100644
--- a/src/backend/utils/adt/numutils.c
+++ b/src/backend/utils/adt/numutils.c
@@ -20,6 +20,64 @@
 
 #include "common/int.h"
 #include "utils/builtins.h"
+#include "port/pg_bitutils.h"
+
+/*
+ * A table of all two-digit numbers. This is used to speed up decimal digit
+ * generation by copying pairs of digits into the final output.
+ */
+static const char DIGIT_TABLE[200] =
+"00" "01" "02" "03" "04" "05" "06" "07" "08" "09"
+"10" "11" "12" "13" "14" "15" "16" "17" "18" "19"
+"20" "21" "22" "23" "24" "25" "26" "27" "28" "29"
+"30" "31" "32" "33" "34" "35" "36" "37" "38" "39"
+"40" "41" "42" "43" "44" "45" "46" "47" "48" "49"
+"50" "51" "52" "53" "54" "55" "56" "57" "58" "59"
+"60" "61" "62" "63" "64" "65" "66" "67" "68" "69"
+"70" "71" "72" "73" "74" "75" "76" "77" "78" "79"
+"80" "81" "82" "83" "84" "85" "86" "87" "88" "89"
+"90" "91" "92" "93" "94" "95" "96" "97" "98" "99";
+
+/*
+ * Adapted from http://graphics.stanford.edu/~seander/bithacks.html#IntegerLog10
+ */
+static inline uint32
+decimalLength32(const uint32 v)
+{
+	uint32			t;
+	static uint64	PowersOfTen[] =
+	{1,10,100,
+	 1000, 1, 10,
+	 100,  1000,  1,
+	 10};
+	/*
+	 * Compute base-10 logarithm by dividing the base-2 logarithm
+	 * by a good-enough approximation of the base-2 logarithm of 10
+	 */
+	t = (pg_leftmost_one_pos32(v) + 1)*1233/4096;
+	return t + (v >= PowersOfTen[t]);
+}
+
+static inline uint32
+decimalLength64(const uint64 v)
+{
+	uint32			t;
+	static uint64	PowersOfTen[] =
+	{1,10,100,
+	 1000, 1, 10,
+	 100,  1000,  1,
+	 10,   100,   1000,
+	 1,10,100,
+	 1000, 1, 10,
+	 100};
+
+	/*
+	 * Compute base-10 logar

Re: Efficient output for integer types

2019-09-20 Thread David Fetter
On Fri, Sep 20, 2019 at 11:09:16PM +0200, David Fetter wrote:
> On Fri, Sep 20, 2019 at 09:14:51PM +0200, David Fetter wrote:
> > On Wed, Sep 18, 2019 at 04:27:46PM +0900, Kyotaro Horiguchi wrote:
> > > Hello.
> > > 
> > > At Wed, 18 Sep 2019 05:42:01 +0200, David Fetter  wrote 
> > > in <20190918034201.gx31...@fetter.org>
> > > > On Tue, Sep 17, 2019 at 09:01:57AM +0200, David Fetter wrote:
> > > > > On Tue, Sep 17, 2019 at 08:55:05AM +0200, David Fetter wrote:
> > > > > > On Sun, Sep 15, 2019 at 09:18:49AM +0200, David Fetter wrote:
> > > > > > > Folks,
> > > > > > > 
> > > > > > > Please find attached a couple of patches intended to $subject.
> > > > > > > 
> > > > > > > This patch set cut the time to copy ten million rows of randomly 
> > > > > > > sized
> > > > > > > int8s (10 of them) by about a third, so at least for that case, 
> > > > > > > it's
> > > > > > > pretty decent.
> > > > > > 
> > > > > > Added int4 output, removed the sprintf stuff, as it didn't seem to
> > > > > > help in any cases I was testing.
> > > > > 
> > > > > Found a couple of "whiles" that should have been "ifs."
> > > > 
> > > > Factored out some inefficient functions and made the guts use the more
> > > > efficient function.
> > > 
> > > I'm not sure this is on the KISS principle, but looked it and
> > > have several random comments.
> > > 
> > > +numutils.o: CFLAGS += $(PERMIT_DECLARATION_AFTER_STATEMENT)
> 
> Oops.  Missed a few.

D'oh!  Wrong patch.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
>From 51d793143daf44cc0f01a00d6aedeaf3fdd88230 Mon Sep 17 00:00:00 2001
From: David Fetter 
Date: Sun, 15 Sep 2019 00:06:29 -0700
Subject: [PATCH v9] Make int4 and int8 operations more efficent
To: hackers
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="2.21.0"

This is a multi-part message in MIME format.
--2.21.0
Content-Type: text/plain; charset=UTF-8; format=fixed
Content-Transfer-Encoding: 8bit


- Output routines now do more digits per iteration, and
- Code determines the number of decimal digits in int4/int8 efficiently
- Split off pg_ltoa_n from pg_ltoa
- Use same to make other functions shorter

diff --git a/src/backend/access/common/printsimple.c b/src/backend/access/common/printsimple.c
index 651ade14dd..5c5b6d33b2 100644
--- a/src/backend/access/common/printsimple.c
+++ b/src/backend/access/common/printsimple.c
@@ -112,7 +112,7 @@ printsimple(TupleTableSlot *slot, DestReceiver *self)
 			case INT8OID:
 {
 	int64		num = DatumGetInt64(value);
-	char		str[23];	/* sign, 21 digits and '\0' */
+	char		str[MAXINT8LEN + 1];
 
 	pg_lltoa(num, str);
 	pq_sendcountedtext(&buf, str, strlen(str), false);
diff --git a/src/backend/utils/adt/int8.c b/src/backend/utils/adt/int8.c
index 0ff9394a2f..6230807906 100644
--- a/src/backend/utils/adt/int8.c
+++ b/src/backend/utils/adt/int8.c
@@ -27,8 +27,6 @@
 #include "utils/builtins.h"
 
 
-#define MAXINT8LEN		25
-
 typedef struct
 {
 	int64		current;
diff --git a/src/backend/utils/adt/numutils.c b/src/backend/utils/adt/numutils.c
index 70138feb29..630aafd168 100644
--- a/src/backend/utils/adt/numutils.c
+++ b/src/backend/utils/adt/numutils.c
@@ -20,6 +20,64 @@
 
 #include "common/int.h"
 #include "utils/builtins.h"
+#include "port/pg_bitutils.h"
+
+/*
+ * A table of all two-digit numbers. This is used to speed up decimal digit
+ * generation by copying pairs of digits into the final output.
+ */
+static const char DIGIT_TABLE[200] =
+"00" "01" "02" "03" "04" "05" "06" "07" "08" "09"
+"10" "11" "12" "13" "14" "15" "16" "17" "18" "19"
+"20" "21" "22" "23" "24" "25" "26" "27" "28" "29"
+"30" "31" "32" "33" "34" "35" "36" "37" "38" "39"
+"40" "41" "42" "43" "44" "45" "46" "47" "48" "49"
+"50" "51" "52" "53" "54" "55" "56" "57" "58" "59"
+"60" "61" "62" "63" "64" "65" "66" "67" "68" "69"
+"70" "71" "72" "73" "74" "75" "76" "77" "78" "79"
+"80" "81" "82" "83" "84" "85" "86" "87" "88" "89"
+"90" "91" "92" "93" "94" "95" "96" "97" "98" "99";
+
+/*
+ * Adapted from http://graphics.stanford.edu/~seander/bithacks.html#IntegerLog10
+ */
+static inline uint32
+decimalLength32(const uint32 v)
+{
+	uint32			t;
+	static uint64	PowersOfTen[] =
+	{1,10,100,
+	 1000, 1, 10,
+	 100,  1000,  1,
+	 10};
+	/*
+	 * Compute base-10 logarithm by dividing the base-2 logarithm
+	 * by a good-enough approximation of the base-2 logarithm of 10
+	 */
+	t = (pg_leftmost_one_pos32(v) + 1)*1233/4096;
+	return t + (v >= PowersOfTen[t]);
+}
+
+static inline uint32
+decimalLength64(const uint64 v)
+{
+	uint32			t;
+	static uint64	PowersOfTen[] =
+	{1,10,100,
+	 1000, 1, 10,
+	 100,  1000,  1,
+	 10,   100,   1000,
+	 10

Re: subscriptionCheck failures on nightjar

2019-09-20 Thread Andres Freund
Hi,

On 2019-09-20 16:25:21 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > Since now a number of people (I tried as well), failed to reproduce this
> > locally, I propose that we increase the log-level during this test on
> > master. And perhaps expand the set of debugging information. With the
> > hope that the additional information on the cases encountered on the bf
> > helps us build a reproducer or, even better, diagnose the issue
> > directly.  If people agree, I'll come up with a patch.
> 
> I recreated my freebsd-9-under-qemu setup and I can still reproduce
> the problem, though not with high reliability (order of 1 time in 10).
> Anything particular you want logged?

A DEBUG2 log would help a fair bit, because it'd log some information
about what changes the "horizons" determining when data may be removed.

Perhaps with the additional elogs attached? I lowered some messages to
DEBUG2 so we don't have to suffer the noise of the ipc.c DEBUG3
messages.

If I use a TEMP_CONFIG setting log_min_messages=DEBUG2 with the patches
applied, the subscription tests still pass.

I hope they still fail on your setup, even though the increased logging
volume probably changes timing somewhat.

Greetings,

Andres Freund
diff --git i/src/backend/access/transam/xlog.c w/src/backend/access/transam/xlog.c
index b7ff004234a..0d48ef3f039 100644
--- i/src/backend/access/transam/xlog.c
+++ w/src/backend/access/transam/xlog.c
@@ -2676,9 +2676,16 @@ XLogSetAsyncXactLSN(XLogRecPtr asyncXactLSN)
 void
 XLogSetReplicationSlotMinimumLSN(XLogRecPtr lsn)
 {
+	XLogRecPtr old_lsn;
+
 	SpinLockAcquire(&XLogCtl->info_lck);
+	old_lsn = XLogCtl->replicationSlotMinLSN;
 	XLogCtl->replicationSlotMinLSN = lsn;
 	SpinLockRelease(&XLogCtl->info_lck);
+
+	elog(DEBUG2, "updating slot minimum lsn from %X/%X to %X/%X",
+		 (uint32) (old_lsn >> 32), (uint32) old_lsn,
+		 (uint32) (lsn >> 32), (uint32) lsn);
 }
 
 
diff --git i/src/backend/replication/logical/snapbuild.c w/src/backend/replication/logical/snapbuild.c
index 0bd1d0f9545..f7d270340de 100644
--- i/src/backend/replication/logical/snapbuild.c
+++ w/src/backend/replication/logical/snapbuild.c
@@ -913,7 +913,7 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
 	memcpy(builder->committed.xip, workspace,
 		   surviving_xids * sizeof(TransactionId));
 
-	elog(DEBUG3, "purged committed transactions from %u to %u, xmin: %u, xmax: %u",
+	elog(DEBUG2, "purged committed transactions from %u to %u, xmin: %u, xmax: %u",
 		 (uint32) builder->committed.xcnt, (uint32) surviving_xids,
 		 builder->xmin, builder->xmax);
 	builder->committed.xcnt = surviving_xids;
@@ -1140,7 +1140,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	xmin = ReorderBufferGetOldestXmin(builder->reorder);
 	if (xmin == InvalidTransactionId)
 		xmin = running->oldestRunningXid;
-	elog(DEBUG3, "xmin: %u, xmax: %u, oldest running: %u, oldest xmin: %u",
+	elog(DEBUG2, "xmin: %u, xmax: %u, oldest running: %u, oldest xmin: %u",
 		 builder->xmin, builder->xmax, running->oldestRunningXid, xmin);
 	LogicalIncreaseXminForSlot(lsn, xmin);
 
@@ -1952,6 +1952,10 @@ CheckPointSnapBuild(void)
 	/* now check for the restart ptrs from existing slots */
 	cutoff = ReplicationSlotsComputeLogicalRestartLSN();
 
+	elog(DEBUG2, "doing snapbuild checkpoint with restart lsn %X/%X, redo %X/%X",
+		 (uint32) (cutoff >> 32), (uint32) cutoff,
+		 (uint32) (redo >> 32), (uint32) redo);
+
 	/* don't start earlier than the restart lsn */
 	if (redo < cutoff)
 		cutoff = redo;
diff --git i/src/backend/storage/ipc/procarray.c w/src/backend/storage/ipc/procarray.c
index 8abcfdf841f..9e8a93e5962 100644
--- i/src/backend/storage/ipc/procarray.c
+++ w/src/backend/storage/ipc/procarray.c
@@ -2981,16 +2981,24 @@ void
 ProcArraySetReplicationSlotXmin(TransactionId xmin, TransactionId catalog_xmin,
 bool already_locked)
 {
+	TransactionId old_xmin;
+	TransactionId old_catalog_xmin;
+
 	Assert(!already_locked || LWLockHeldByMe(ProcArrayLock));
 
 	if (!already_locked)
 		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
+	old_xmin = procArray->replication_slot_xmin;
+	old_catalog_xmin = procArray->replication_slot_catalog_xmin;
 	procArray->replication_slot_xmin = xmin;
 	procArray->replication_slot_catalog_xmin = catalog_xmin;
 
 	if (!already_locked)
 		LWLockRelease(ProcArrayLock);
+
+	elog(DEBUG2, "updating slot xmin/catalog_xmin from %u/%u to %u/%u",
+		 old_xmin, old_catalog_xmin, xmin, catalog_xmin);
 }
 
 /*


Re: subscriptionCheck failures on nightjar

2019-09-20 Thread Tom Lane
Andres Freund  writes:
> On 2019-09-20 16:25:21 -0400, Tom Lane wrote:
>> I recreated my freebsd-9-under-qemu setup and I can still reproduce
>> the problem, though not with high reliability (order of 1 time in 10).
>> Anything particular you want logged?

> A DEBUG2 log would help a fair bit, because it'd log some information
> about what changes the "horizons" determining when data may be removed.

Actually, what I did was as attached [1], and I am getting traces like
[2].  The problem seems to occur only when there are two or three
processes concurrently creating the same snapshot file.  It's not
obvious from the debug trace, but the snapshot file *does* exist
after the music stops.

It is very hard to look at this trace and conclude anything other
than "rename(2) is broken, it's not atomic".  Nothing in our code
has deleted the file: no checkpoint has started, nor do we see
the DEBUG1 output that CheckPointSnapBuild ought to produce.
But fsync_fname momentarily can't see it (and then later another
process does see it).

It is now apparent why we're only seeing this on specific ancient
platforms.  I looked around for info about rename(2) not being
atomic, and I found this info about FreeBSD:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=94849

The reported symptom there isn't quite the same, so probably there
is another issue, but there is plenty of reason to be suspicious
that UFS rename(2) is buggy in this release.  As for dromedary's
ancient version of macOS, Apple is exceedinly untransparent about
their bugs, but I found

http://www.weirdnet.nl/apple/rename.html

In short, what we got here is OS bugs that have probably been
resolved years ago.

The question is what to do next.  Should we just retire these
specific buildfarm critters, or do we want to push ahead with
getting rid of the PANIC here?

regards, tom lane




Re: subscriptionCheck failures on nightjar

2019-09-20 Thread Tom Lane
Sigh, forgot about attaching the attachments ...

regards, tom lane

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 0bd1d0f..53fd33c 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1670,11 +1670,14 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
  errmsg("could not rename file \"%s\" to \"%s\": %m",
 		tmppath, path)));
 	}
+	elog(DEBUG1, "renamed serialized snapshot %s to %s", tmppath, path);
 
 	/* make sure we persist */
 	fsync_fname(path, false);
 	fsync_fname("pg_logical/snapshots", true);
 
+	elog(DEBUG1, "fsynced %s", path);
+
 	/*
 	 * Now there's no way we can loose the dumped state anymore, remember this
 	 * as a serialization point.
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bd..2986582 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -9,6 +9,11 @@ use Test::More tests => 9;
 
 my $node_publisher = get_new_node('publisher');
 $node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', q{
+log_checkpoints = on
+log_min_messages = 'debug1'
+log_error_verbosity = verbose
+});
 $node_publisher->start;
 
 my $node_subscriber = get_new_node('subscriber');
2019-09-20 17:08:54.361 EDT [34418] DEBUG:  0: registering background 
worker "logical replication launcher"
2019-09-20 17:08:54.361 EDT [34418] LOCATION:  RegisterBackgroundWorker, 
bgworker.c:855
2019-09-20 17:08:54.362 EDT [34418] LOG:  0: starting PostgreSQL 13devel on 
x86_64-unknown-freebsd9.0, compiled by gcc (GCC) 4.2.1 20070831 patched 
[FreeBSD], 64-bit
2019-09-20 17:08:54.362 EDT [34418] LOCATION:  PostmasterMain, postmaster.c:1104
2019-09-20 17:08:54.362 EDT [34418] LOG:  0: listening on Unix socket 
"/tmp/2lehMLoBNn/.s.PGSQL.65366"
2019-09-20 17:08:54.362 EDT [34418] LOCATION:  StreamServerPort, pqcomm.c:587
2019-09-20 17:08:54.363 EDT [34419] LOG:  0: database system was shut down 
at 2019-09-20 17:08:54 EDT
2019-09-20 17:08:54.363 EDT [34419] LOCATION:  StartupXLOG, xlog.c:6241
2019-09-20 17:08:54.363 EDT [34419] DEBUG:  0: checkpoint record is at 
0/15D07F0
2019-09-20 17:08:54.363 EDT [34419] LOCATION:  StartupXLOG, xlog.c:6531
2019-09-20 17:08:54.363 EDT [34419] DEBUG:  0: redo record is at 0/15D07F0; 
shutdown true
2019-09-20 17:08:54.363 EDT [34419] LOCATION:  StartupXLOG, xlog.c:6609
2019-09-20 17:08:54.363 EDT [34419] DEBUG:  0: next transaction ID: 490; 
next OID: 12674
2019-09-20 17:08:54.363 EDT [34419] LOCATION:  StartupXLOG, xlog.c:6613
2019-09-20 17:08:54.363 EDT [34419] DEBUG:  0: next MultiXactId: 1; next 
MultiXactOffset: 0
2019-09-20 17:08:54.363 EDT [34419] LOCATION:  StartupXLOG, xlog.c:6616
2019-09-20 17:08:54.363 EDT [34419] DEBUG:  0: oldest unfrozen transaction 
ID: 483, in database 1
2019-09-20 17:08:54.363 EDT [34419] LOCATION:  StartupXLOG, xlog.c:6619
2019-09-20 17:08:54.363 EDT [34419] DEBUG:  0: oldest MultiXactId: 1, in 
database 1
2019-09-20 17:08:54.363 EDT [34419] LOCATION:  StartupXLOG, xlog.c:6622
2019-09-20 17:08:54.363 EDT [34419] DEBUG:  0: commit timestamp Xid 
oldest/newest: 0/0
2019-09-20 17:08:54.363 EDT [34419] LOCATION:  StartupXLOG, xlog.c:6626
2019-09-20 17:08:54.363 EDT [34419] DEBUG:  0: transaction ID wrap limit is 
2147484130, limited by database with OID 1
2019-09-20 17:08:54.363 EDT [34419] LOCATION:  SetTransactionIdLimit, 
varsup.c:410
2019-09-20 17:08:54.363 EDT [34419] DEBUG:  0: MultiXactId wrap limit is 
2147483648, limited by database with OID 1
2019-09-20 17:08:54.363 EDT [34419] LOCATION:  SetMultiXactIdLimit, 
multixact.c:2271
2019-09-20 17:08:54.363 EDT [34419] DEBUG:  0: starting up replication slots
2019-09-20 17:08:54.363 EDT [34419] LOCATION:  StartupReplicationSlots, 
slot.c:1114
2019-09-20 17:08:54.363 EDT [34419] DEBUG:  0: MultiXactId wrap limit is 
2147483648, limited by database with OID 1
2019-09-20 17:08:54.363 EDT [34419] LOCATION:  SetMultiXactIdLimit, 
multixact.c:2271
2019-09-20 17:08:54.363 EDT [34419] DEBUG:  0: MultiXact member stop limit 
is now 4294914944 based on MultiXact 1
2019-09-20 17:08:54.363 EDT [34419] LOCATION:  SetOffsetVacuumLimit, 
multixact.c:2634
2019-09-20 17:08:54.364 EDT [34418] DEBUG:  0: starting background worker 
process "logical replication launcher"
2019-09-20 17:08:54.364 EDT [34418] LOCATION:  do_start_bgworker, 
postmaster.c:5749
2019-09-20 17:08:54.364 EDT [34418] LOG:  0: database system is ready to 
accept connections
2019-09-20 17:08:54.364 EDT [34418] LOCATION:  reaper, postmaster.c:3017
2019-09-20 17:08:54.365 EDT [34423] DEBUG:  0: autovacuum launcher started
2019-09-20 17:08:54.365 EDT [34423] LOCATION:  AutoVacLauncherMain, 
autovacuum.c:441
2019-09-20 17:08:54.365 EDT [34425] DEBUG:  0: logical replication launcher 
s

Re: subscriptionCheck failures on nightjar

2019-09-20 Thread Andres Freund
Hi,

On 2019-09-20 17:49:27 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2019-09-20 16:25:21 -0400, Tom Lane wrote:
> >> I recreated my freebsd-9-under-qemu setup and I can still reproduce
> >> the problem, though not with high reliability (order of 1 time in 10).
> >> Anything particular you want logged?
> 
> > A DEBUG2 log would help a fair bit, because it'd log some information
> > about what changes the "horizons" determining when data may be removed.
> 
> Actually, what I did was as attached [1], and I am getting traces like
> [2].  The problem seems to occur only when there are two or three
> processes concurrently creating the same snapshot file.  It's not
> obvious from the debug trace, but the snapshot file *does* exist
> after the music stops.
> 
> It is very hard to look at this trace and conclude anything other
> than "rename(2) is broken, it's not atomic". Nothing in our code
> has deleted the file: no checkpoint has started, nor do we see
> the DEBUG1 output that CheckPointSnapBuild ought to produce.
> But fsync_fname momentarily can't see it (and then later another
> process does see it).

Yikes. No wondering most of us weren't able to reproduce the
problem. And that staring at our code didn't point to a bug.

Nice catch.


> In short, what we got here is OS bugs that have probably been
> resolved years ago.
> 
> The question is what to do next.  Should we just retire these
> specific buildfarm critters, or do we want to push ahead with
> getting rid of the PANIC here?

Hm. Given that the fsync failing is actually an issue, I'm somewhat
disinclined to remove the PANIC. It's not like only raising an ERROR
actually solves anything, except making the problem even harder to
diagnose? Or that we otherwise are ok, with renames not being atomic?

So I'd be tentatively in favor of either upgrading, replacing the
filesystem (perhaps ZFS isn't buggy in the same way?), or retiring
those animals.

Greetings,

Andres Freund




Re: subscriptionCheck failures on nightjar

2019-09-20 Thread Alvaro Herrera
On 2019-Sep-20, Tom Lane wrote:

> Actually, what I did was as attached [1], and I am getting traces like
> [2].  The problem seems to occur only when there are two or three
> processes concurrently creating the same snapshot file.  It's not
> obvious from the debug trace, but the snapshot file *does* exist
> after the music stops.

Uh .. I didn't think it was possible that we would build the same
snapshot file more than once.  Isn't that a waste of time anyway?  Maybe
we can fix the symptom by just not doing that in the first place?
I don't have a strategy to do that, but seems worth considering before
retiring the bf animals.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: subscriptionCheck failures on nightjar

2019-09-20 Thread Andres Freund
Hi, 

On September 20, 2019 3:06:20 PM PDT, Alvaro Herrera  
wrote:
>On 2019-Sep-20, Tom Lane wrote:
>
>> Actually, what I did was as attached [1], and I am getting traces
>like
>> [2].  The problem seems to occur only when there are two or three
>> processes concurrently creating the same snapshot file.  It's not
>> obvious from the debug trace, but the snapshot file *does* exist
>> after the music stops.
>
>Uh .. I didn't think it was possible that we would build the same
>snapshot file more than once.  Isn't that a waste of time anyway? 
>Maybe
>we can fix the symptom by just not doing that in the first place?
>I don't have a strategy to do that, but seems worth considering before
>retiring the bf animals.

We try to avoid it, but the check is racy. Check comments in 
SnapBuildSerialize. We could introduce locking etc to avoid that, but that 
seems overkill, given that were really just dealing with a broken os.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: subscriptionCheck failures on nightjar

2019-09-20 Thread Tom Lane
Alvaro Herrera  writes:
> Uh .. I didn't think it was possible that we would build the same
> snapshot file more than once.  Isn't that a waste of time anyway?  Maybe
> we can fix the symptom by just not doing that in the first place?
> I don't have a strategy to do that, but seems worth considering before
> retiring the bf animals.

The comment near the head of SnapBuildSerialize says

 * there is an obvious race condition here between the time we stat(2) the
 * file and us writing the file. But we rename the file into place
 * atomically and all files created need to contain the same data anyway,
 * so this is perfectly fine, although a bit of a resource waste. Locking
 * seems like pointless complication.

which seems like a reasonable argument.  Also, this is hardly the only
place where we expect rename(2) to be atomic.  So I tend to agree with
Andres that we should consider OSes with such a bug to be unsupported.

Dromedary is running the last release of macOS that supports 32-bit
hardware, so if we decide to kick that to the curb, I'd either shut
down the box or put some newer Linux or BSD variant on it.

regards, tom lane




Re: WAL recycled despite logical replication slot

2019-09-20 Thread Andres Freund
Hi, 

On September 20, 2019 5:45:34 AM PDT, Jeff Janes  wrote:
>While testing something else (whether "terminating walsender process
>due to
>replication timeout" was happening spuriously), I had logical
>replication
>set up streaming a default pgbench transaction load, with the publisher
>being 13devel-e1c8743 and subscriber being 12BETA4.  Eventually I
>started
>getting errors about requested wal segments being already removed:
>
>10863 sub idle 0 2019-09-19 17:14:58.140 EDT LOG:  starting logical
>decoding for slot "sub"
>10863 sub idle 0 2019-09-19 17:14:58.140 EDT DETAIL:  Streaming
>transactions committing after 79/EB0B17A0, reading WAL from
>79/E70736A0.
>10863 sub idle 58P01 2019-09-19 17:14:58.140 EDT ERROR:  requested WAL
>segment 0001007900E7 has already been removed
>10863 sub idle 0 2019-09-19 17:14:58.144 EDT LOG:  disconnection:
>session time: 0:00:00.030 user=jjanes database=jjanes host=10.0.2.2
>port=40830
>
>It had been streaming for about 50 minutes before the error showed up,
>and
>it showed right when streaming was restarting after one of the
>replication
>timeouts.
>
>Is there an innocent explanation for this?  I thought logical
>replication
>slots provided an iron-clad guarantee that WAL would be retained until
>it
>was no longer needed.  I am just using pub/sub, none of the lower level
>stuff.

It indeed should. What's the content of
pg_replication_slot for that slot?

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: WIP: Generic functions for Node types using generated metadata

2019-09-20 Thread Andres Freund
Hi,

On 2019-09-19 22:18:57 -0700, Andres Freund wrote:
> While working on this I evolved the node string format a bit:
> 
> 1) Node types start with the their "normal" name, rather than
>uppercase. There seems little point in having such a divergence.
> 
> 2) The node type is followed by the node-type id. That allows to more
>quickly locate the corresponding node metadata (array and one name
>recheck, rather than a binary search). I.e. the node starts with
>"{Scan 18 " rather than "{SCAN " as before.
> 
> 3) Nodes that contain other nodes as sub-types "inline", still emit {}
>for the subtype. There's no functional need for this, but I found the
>output otherwise much harder to read.  E.g. for mergejoin we'd have
>something like
> 
>{MergeJoin 37 :join {Join 35 :plan {Plan ...} :jointype JOIN_INNER ...} 
> :skip_mark_restore true ...}
> 
> 4) As seen in the above example, enums are decoded to their string
>values. I found that makes the output easier to read. Again, not
>functionally required.
> 
> 5) Value nodes aren't emitted without a {Value ...} anymore. I changed
>this when I expanded the WRITE/READ tests, and encountered failures
>because the old encoding is not entirely rountrip safe
>(e.g. -INT32_MIN will be parsed as a float at raw parse time, but
>after write/read, it'll be parsed as an integer). While that could be
>fixed in other ways (e.g. by emitting a trailing . for all floats), I
>also found it to be clearer this way - Value nodes are otherwise
>undistinguishable from raw strings, raw numbers etc, which is not
>great.
> 
> It'd also be easier to now just change the node format to something else.

E.g. to just use json. Which'd certainly be a lot easier to delve into,
given the amount of tooling (both on the pg SQL level, and for
commandline / editors / etc).  I don't think it'd be any less
efficient. There'd be a few more = signs, but the lexer is smarter /
faster than the one currently in use for the outfuncs format.  And we'd
just reuse pg_parse_json rather than having a dedicated parser.

- Andres




Re: backup manifests

2019-09-20 Thread David Steele
On 9/20/19 2:55 PM, Robert Haas wrote:
> On Fri, Sep 20, 2019 at 11:09 AM David Steele  wrote:
>>
>> It sucks to make that a prereq for this project but the longer we kick
>> that can down the road...
> 
> There are no doubt many patches that would benefit from having more
> backend infrastructure exposed in frontend contexts, and I think we're
> slowly moving in that direction, but I generally do not believe in
> burdening feature patches with major infrastructure improvements.

The hardest part about technical debt is knowing when to incur it.  It
is never a cut-and-dried choice.

>> This talk was good fun.  The largest number of tables we've seen is a
>> few hundred thousand, but that still adds up to more than a million
>> files to backup.
> 
> A quick survey of some of my colleagues turned up a few examples of
> people with 2-4 million files to backup, so similar kind of ballpark.
> Probably not big enough for the manifest to hit the 1GB mark, but
> getting close.

I have so many doubts about clusters with this many tables, but we do
support it, so...

>>> I hear you saying that this is going to end up being just as complex
>>> in the end, but I don't think I believe it.  It sounds to me like the
>>> difference between spending a couple of hours figuring this out and
>>> spending a couple of months trying to figure it out and maybe not
>>> actually getting anywhere.
>>
>> Maybe the initial implementation will be easier but I am confident we'll
>> pay for it down the road.  Also, don't we want users to be able to read
>> this file?  Do we really want them to need to cook up a custom parser in
>> Perl, Go, Python, etc.?
> 
> Well, I haven't heard anybody complain that they can't read a
> backup_label file because it's too hard to cook up a parser.  And I
> think the reason is pretty clear: such files are not hard to parse.
> Similarly for a pg_hba.conf file.  This case is a little more
> complicated than those, but AFAICS, not enormously so. Actually, it
> seems like a combination of those two cases: it has some fixed
> metadata fields that can be represented with one line per field, like
> a backup_label, and then a bunch of entries for files that are
> somewhat like entries in a pg_hba.conf file, in that they can be
> represented by a line per record with a certain number of fields on
> each line.

Yeah, they are not hard to parse, but *everyone* has to cook up code for
it.  A bit of a bummer, that.

> I attach here a couple of patches.  The first one does some
> refactoring of relevant code in pg_basebackup, and the second one adds
> checksum manifests using a format that I pulled out of my ear. It
> probably needs some adjustment but I don't think it's crazy.  Each
> file gets a line that looks like this:
> 
> File $FILENAME $FILESIZE $FILEMTIME $FILECHECKSUM

We also include page checksum validation failures in the file record.
Not critical for the first pass, perhaps, but something to keep in mind.

> Right now, the file checksums are computed using SHA-256 but it could
> be changed to anything else for which we've got code. On my system,
> shasum -a256 $FILE produces the same answer that shows up here.  At
> the bottom of the manifest there's a checksum of the manifest itself,
> which looks like this:
> 
> Manifest-Checksum
> 385fe156a8c6306db40937d59f46027cc079350ecf5221027d71367675c5f781
> 
> That's a SHA-256 checksum of the file contents excluding the final
> line. It can be verified by feeding all the file contents except the
> last line to shasum -a256. I can't help but observe that if the file
> were defined to be a JSONB blob, it's not very clear how you would
> include a checksum of the blob contents in the blob itself, but with a
> format based on a bunch of lines of data, it's super-easy to generate
> and super-easy to write tools that verify it.

You can do this in JSON pretty easily by handling the terminating
brace/bracket:

{
*,
"checksum":
}

But of course a linefeed-delimited file is even easier.

> This is just a prototype so I haven't written a verification tool, and
> there's a bunch of testing and documentation and so forth that would
> need to be done aside from whatever we've got to hammer out in terms
> of design issues and file formats.  But I think it's cool, and perhaps
> some discussion of how it could be evolved will get us closer to a
> resolution everybody can at least live with.

I had a quick look and it seems pretty reasonable.  I'll need to
generate a manifest to see if I can spot any obvious gotchas.

-- 
-David
da...@pgmasters.net




Re: Wrong results using initcap() with non normalized string

2019-09-20 Thread Tom Lane
=?UTF-8?Q?Juan_Jos=C3=A9_Santamar=C3=ADa_Flecha?=  
writes:
> I have come around a strange situation when using a unicode string
> that has non normalized characters. The attached script 'initcap.sql'
> can reproduce the problem.
> The attached patch can fix the issue.

If we're going to start worrying about non-normalized characters,
I suspect there are far more places than this one that we'd have
to consider buggy :-(.

As for the details of the patch, it seems overly certain that
it's working with UTF8 data.

regards, tom lane




Re: Nondeterministic collations vs. text_pattern_ops

2019-09-20 Thread Tom Lane
Peter Eisentraut  writes:
>> Here is a draft patch.

Where are we on pushing that?  I'm starting to get antsy about the
amount of time remaining before rc1.  It's a low-risk fix, but still,
it'd be best to have a complete buildfarm cycle on it before Monday's
wrap.

regards, tom lane




Re: PGCOLOR? (Re: pgsql: Unified logging system for command-line programs)

2019-09-20 Thread Tom Lane
Peter Eisentraut  writes:
> It looks like there is documentation for PG_COLORS in the release notes
> now, which seems like an odd place.  Suggestions for a better place?

BTW, as far as that goes, it looks like PG_COLOR is documented separately
in each frontend program's "Environment" man page section.  That's a bit
duplicative but I don't think we have a better answer right now.  Seems
like you just need to add boilerplate text about PG_COLORS alongside
each of those.

regards, tom lane




Re: Optimization of some jsonb functions

2019-09-20 Thread Alvaro Herrera
I pushed the first few parts.  The attached is a rebased copy of the
last remaining piece.  However, I didn't quite understand what this was
doing, so I refrained from pushing.  I think there are two patches here:
one that adapts the API of findJsonbValueFromContainer and
getIthJsonbValueFromContainer to take the output result pointer as an
argument, allowing to save palloc cycles just like the newly added
getKeyJsonValueFromContainer(); and the other changes JsonbDeepContains
so that it uses a new function (which is a function with a weird API
that would be extracted from findJsonbValueFromContainer).

Also, the current patch just passes NULL into the routines from
jsonpath_exec.c but I think it would be useful to pass pointers into
stack-allocated result structs instead, at least in getJsonPathVariable.

Since the majority of this patchset got pushed, I'll leave this for
Nikita to handle for the next commitfest if he wants to, and mark this
CF entry as committed.

Thanks!

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/utils/adt/jsonb_op.c b/src/backend/utils/adt/jsonb_op.c
index a64206eeb1..82c4b0b2cb 100644
--- a/src/backend/utils/adt/jsonb_op.c
+++ b/src/backend/utils/adt/jsonb_op.c
@@ -24,6 +24,7 @@ jsonb_exists(PG_FUNCTION_ARGS)
 	Jsonb	   *jb = PG_GETARG_JSONB_P(0);
 	text	   *key = PG_GETARG_TEXT_PP(1);
 	JsonbValue	kval;
+	JsonbValue	vval;
 	JsonbValue *v = NULL;
 
 	/*
@@ -38,7 +39,7 @@ jsonb_exists(PG_FUNCTION_ARGS)
 
 	v = findJsonbValueFromContainer(&jb->root,
 	JB_FOBJECT | JB_FARRAY,
-	&kval);
+	&kval, &vval);
 
 	PG_RETURN_BOOL(v != NULL);
 }
@@ -59,6 +60,7 @@ jsonb_exists_any(PG_FUNCTION_ARGS)
 	for (i = 0; i < elem_count; i++)
 	{
 		JsonbValue	strVal;
+		JsonbValue	valVal;
 
 		if (key_nulls[i])
 			continue;
@@ -69,7 +71,7 @@ jsonb_exists_any(PG_FUNCTION_ARGS)
 
 		if (findJsonbValueFromContainer(&jb->root,
 		JB_FOBJECT | JB_FARRAY,
-		&strVal) != NULL)
+		&strVal, &valVal) != NULL)
 			PG_RETURN_BOOL(true);
 	}
 
@@ -92,6 +94,7 @@ jsonb_exists_all(PG_FUNCTION_ARGS)
 	for (i = 0; i < elem_count; i++)
 	{
 		JsonbValue	strVal;
+		JsonbValue	valVal;
 
 		if (key_nulls[i])
 			continue;
@@ -102,7 +105,7 @@ jsonb_exists_all(PG_FUNCTION_ARGS)
 
 		if (findJsonbValueFromContainer(&jb->root,
 		JB_FOBJECT | JB_FARRAY,
-		&strVal) == NULL)
+		&strVal, &valVal) == NULL)
 			PG_RETURN_BOOL(false);
 	}
 
diff --git a/src/backend/utils/adt/jsonb_util.c b/src/backend/utils/adt/jsonb_util.c
index 7e0d9de7f0..7ec33cebf2 100644
--- a/src/backend/utils/adt/jsonb_util.c
+++ b/src/backend/utils/adt/jsonb_util.c
@@ -33,6 +33,8 @@
 #define JSONB_MAX_ELEMS (Min(MaxAllocSize / sizeof(JsonbValue), JB_CMASK))
 #define JSONB_MAX_PAIRS (Min(MaxAllocSize / sizeof(JsonbPair), JB_CMASK))
 
+static JsonbValue *findJsonbElementInArray(JsonbContainer *container,
+		   JsonbValue *elem, JsonbValue *res);
 static void fillJsonbValue(JsonbContainer *container, int index,
 		   char *base_addr, uint32 offset,
 		   JsonbValue *result);
@@ -327,38 +329,19 @@ compareJsonbContainers(JsonbContainer *a, JsonbContainer *b)
  */
 JsonbValue *
 findJsonbValueFromContainer(JsonbContainer *container, uint32 flags,
-			JsonbValue *key)
+			JsonbValue *key, JsonbValue *res)
 {
-	JEntry	   *children = container->children;
 	int			count = JsonContainerSize(container);
 
 	Assert((flags & ~(JB_FARRAY | JB_FOBJECT)) == 0);
 
-	/* Quick out without a palloc cycle if object/array is empty */
+	/* Quick out if object/array is empty */
 	if (count <= 0)
 		return NULL;
 
 	if ((flags & JB_FARRAY) && JsonContainerIsArray(container))
 	{
-		JsonbValue *result = palloc(sizeof(JsonbValue));
-		char	   *base_addr = (char *) (children + count);
-		uint32		offset = 0;
-		int			i;
-
-		for (i = 0; i < count; i++)
-		{
-			fillJsonbValue(container, i, base_addr, offset, result);
-
-			if (key->type == result->type)
-			{
-if (equalsJsonbScalarValue(key, result))
-	return result;
-			}
-
-			JBE_ADVANCE_OFFSET(offset, children[i]);
-		}
-
-		pfree(result);
+		return findJsonbElementInArray(container, key, res);
 	}
 	else if ((flags & JB_FOBJECT) && JsonContainerIsObject(container))
 	{
@@ -366,13 +349,48 @@ findJsonbValueFromContainer(JsonbContainer *container, uint32 flags,
 		Assert(key->type == jbvString);
 
 		return getKeyJsonValueFromContainer(container, key->val.string.val,
-			key->val.string.len, NULL);
+			key->val.string.len, res);
 	}
 
 	/* Not found */
 	return NULL;
 }
 
+/*
+ * Subroutine for findJsonbValueFromContainer
+ *
+ * Find scalar element in Jsonb array and return it.
+ */
+static JsonbValue *
+findJsonbElementInArray(JsonbContainer *container, JsonbValue *elem,
+		JsonbValue *res)
+{
+	JsonbValue *result;
+	JEntry	   *children = container->children;
+	int			count = JsonContainerSize(container)

Re: Wrong results using initcap() with non normalized string

2019-09-20 Thread Alvaro Herrera
On 2019-Sep-20, Tom Lane wrote:

> =?UTF-8?Q?Juan_Jos=C3=A9_Santamar=C3=ADa_Flecha?= 
>  writes:
> > I have come around a strange situation when using a unicode string
> > that has non normalized characters. The attached script 'initcap.sql'
> > can reproduce the problem.

For illustration purposes:

SELECT initcap('ŞUB');
 initcap 
─
 Şub
(1 fila)

SELECT initcap('ŞUB');
 initcap 
─
 ŞUb
(1 fila)

> If we're going to start worrying about non-normalized characters,
> I suspect there are far more places than this one that we'd have
> to consider buggy :-(.

I would think that we have to start somewhere, rather than take the
position that we can never do anything about it.

(ref: 
https://www.postgresql.org/message-id/flat/53E179E1.3060404%402ndquadrant.com )

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: subscriptionCheck failures on nightjar

2019-09-20 Thread Michael Paquier
On Fri, Sep 20, 2019 at 05:30:48PM +0200, Tomas Vondra wrote:
> But even with that change you haven't managed to reproduce the issue,
> right? Or am I misunderstanding?

No, I was not able to see it on my laptop running Debian.
--
Michael


signature.asc
Description: PGP signature


Re: Add "password_protocol" connection parameter to libpq

2019-09-20 Thread Michael Paquier
On Fri, Sep 20, 2019 at 10:57:04AM -0700, Jeff Davis wrote:
> Thank you, applied.

Okay, I can see which parts you have changed.

> * I also changed the comment above pg_fe_scram_channel_bound() to
> clarify that the caller must also check that the exchange was
> successful.
> 
> * I changed the error message when AUTH_REQ_OK is received without
> channel binding. It seemed confusing before. I also added a test.

And both make sense.

+ * Return true if channel binding was employed and the scram exchange
upper('scram')?

Except for this nit, it looks good to me.
--
Michael


signature.asc
Description: PGP signature


Re: Efficient output for integer types

2019-09-20 Thread Andrew Gierth
> "David" == David Fetter  writes:

 David> +   /* Compute the result string. */
 David> +   if (value >= 1)
 David> +   {
 David> +   const   uint32 value2 = value % 1;
 David> +
 David> +   const uint32 c = value2 % 1;
 David> +   const uint32 d = value2 / 1;
 David> +   const uint32 c0 = (c % 100) << 1;
 David> +   const uint32 c1 = (c / 100) << 1;
 David> +   const uint32 d0 = (d % 100) << 1;
 David> +   const uint32 d1 = (d / 100) << 1;
 David> +
 David> +   char *pos = a + olength - i;
 David> +
 David> +   value /= 1;
 David> +
 David> +   memcpy(pos - 2, DIGIT_TABLE + c0, 2);
 David> +   memcpy(pos - 4, DIGIT_TABLE + c1, 2);
 David> +   memcpy(pos - 6, DIGIT_TABLE + d0, 2);
 David> +   memcpy(pos - 8, DIGIT_TABLE + d1, 2);
 David> +   i += 8;
 David> +   }

For the 32-bit case, there's no point in doing an 8-digit divide
specially, it doesn't save any time. It's sufficient to just change

 David> +   if (value >= 1)

to  while(value >= 1)

in order to process 4 digits at a time.

 David> +   for(int i = 0; i < minwidth - len; i++)
 David> +   {
 David> +   memcpy(str + i, DIGIT_TABLE, 1);
 David> +   }

Should be:
memset(str, '0', minwidth-len);

-- 
Andrew (irc:RhodiumToad)




Re: pgbench - allow to create partitioned tables

2019-09-20 Thread Amit Kapila
On Sat, Sep 21, 2019 at 12:26 AM Fabien COELHO  wrote:
>
> >> I would not bother to create a patch for so small an improvement. This
> >> makes sense in passing because the created function makes it very easy,
> >> but otherwise I'll just drop it.
> >
> > I would prefer to drop for now.
>
> Attached v13 does that, I added a comment instead. I do not think that it
> is an improvement.
>
> > + else
> > + {
> > + fprintf(stderr, "unexpected partition method: \"%s\"\n", ps);
> > + exit(1);
> > + }
> >
> > If we can't catch that earlier, then it might be better to have some
> > version-specific checks rather than such obscure code which is
> > difficult to understand for others.
>
> Hmmm. The code simply checks for the current partitioning and fails if the
> result is unknown, which I understood was what you asked, the previous
> version was just ignoring the result.
>

Yes, this code is correct.  I am not sure if you understood the point,
so let me try again. I am bothered about below code in the patch:
+ /* only print partitioning information if some partitioning was detected */
+ if (partition_method != PART_NONE)

This is the only place now where we check 'whether there are any
partitions' differently.  I am suggesting to make this similar to
other checks (if (partitions > 0)).

> The likelyhood of postgres dropping support for range or hash partitions
> seems unlikely.
>
> This issue rather be raised if an older partition-enabled pgbench is run
> against a newer postgres which adds a new partition method. But then I
> cannot guess when a new partition method will be added, so I cannot put a
> guard with a version about something in the future. Possibly, if no new
> method is ever added, the code will never be triggered.
>

Sure, even in that case your older version of pgbench will be able to
detect by below code:
+ else
+ {
+ fprintf(stderr, "unexpected partition method: \"%s\"\n", ps);
+ exit(1);
+ }

>
> > * improve the comments around query to fetch partitions
>
> What? How?
>
> There are already quite a few comments compared to the length of the
> query.
>

Hmm, you have just written what each part of the query is doing which
I think one can identify if we write some general comment as I have in
the patch to explain the overall intent.  Even if we write what each
part of the statement is doing, the comment explaining overall intent
is required.  I personally don't like writing a comment for each
sub-part of the query as that makes reading the query difficult.  See
the patch sent by me in my previous email.

> > * improve the comments in the patch and make the code look like nearby
> > code
>
> This requirement is to fuzzy. I re-read the changes, and both code and
> comments look okay to me.
>

I have done that in some of the cases in the patch attached by me in
my last email.  Have you looked at those changes?  Try to make those
changes in the next version unless you see something wrong is written
in comments.

> > * pgindent the patch
>
> Done.
>
> > I think we should try to add some note or comment that why we only
> > choose to partition pgbench_accounts table when the user has given
> > --partitions option.
>
> Added as a comment on the initPartition function.
>

I am not sure if something like that is required in the docs, but we
can leave it for now.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Don't codegen deform code for virtual tuples in expr eval for scan fetch

2019-09-20 Thread Soumyadeep Chakraborty
Hello,

In my previous patch 0001, the resulting opblock consisted of a single
br instruction to it's successor opblock. Such a block represents
unnecessary overhead. Even though such a block would be optimized
away, what if optimization is not performed (perhaps due to
jit_optimize_above_cost)? Perhaps we could be more aggressive. We
could maybe remove the opblock altogether. However, such a solution
is not without complexity.

Such unlinking is non-trivial, and is one of the things the
simplify-cfg pass takes care of. One solution could be to run this
pass ad-hoc for the evalexpr function. Or we could perform the
unlinking during codegen. Such a solution is presented below.

To unlink the current opblock from
the cfg we have to replace all of its uses. From the current state of
the code, these uses are either:
1. Terminator instruction of opblocks[i - 1]
2. Terminator instruction of some sub-op-block of opblocks[i - 1]. By
sub-op-block, I mean some block that is directly linked from opblock[i
- 1] but not opblocks[i].
3. Terminator instruction of the entry block.

We should replace all of these uses with opblocks[i + 1] and then and
only then can we delete opblocks[i]. My latest patch v2-0001 in my latest
patch set, achieves this.

I guard LLVMReplaceAllUsesWith() with Assert()s to ensure that we
don't accidentally replace non-terminator uses of opblocks[i], should
they be introduced in the future. If these asserts fail in the future,
replacing these non-terminator instructions with undefs constitutes a
commonly adopted solution. Otherwise, we can always fall back to making
opblocks[i] empty with just the unconditional br, as in my previous
patch 0001.

--
Soumyadeep

On Tue, Sep 17, 2019 at 11:54 PM Soumyadeep Chakraborty <
soumyadeep2...@gmail.com> wrote:

> Hello Hackers,
>
> This is to address a TODO I found in the JIT expression evaluation
> code (opcode =
> EEOP_INNER_FETCHSOME/EEOP_OUTER_FETCHSOME/EEOP_SCAN_FETCHSOME):
>
> * TODO: skip nvalid check if slot is fixed and known to
> * be a virtual slot.
>
> Not only should we skip the nvalid check if the tuple is virtual, the
> whole basic block should be a no-op. There is no deforming to be done,
> JIT (slot_compile_deform) or otherwise (slot_getsomeattrs) for virtual
> tuples.
>
> Attached is a patch 0001 that achieves exactly that by:
>
> 1. Performing the check at the very beginning of the case.
> 2. Emitting an unconditional branch to the next opblock if we have a
> virtual tuple. We end up with a block with just the single
> unconditional branch instruction in this case. This block is optimized
> away when we run llvm_optimize_module().
>
> Also enclosed is another patch 0002 that adds on some minor
> refactoring of conditionals around whether to JIT deform or not.
>
> ---
> Soumyadeep Chakraborty
>


v2-0001-Don-t-codegen-if-tuple-is-virtual-in-expr-eval-of.patch
Description: Binary data


v2-0002-Minor-refactor-JIT-deform-or-not.patch
Description: Binary data


Re: Efficient output for integer types

2019-09-20 Thread David Fetter
On Sat, Sep 21, 2019 at 03:36:21AM +0100, Andrew Gierth wrote:
> > "David" == David Fetter  writes:
> 
>  David> + /* Compute the result string. */
>  David> + if (value >= 1)
>  David> + {
>  David> + const   uint32 value2 = value % 1;
>  David> +
>  David> + const uint32 c = value2 % 1;
>  David> + const uint32 d = value2 / 1;
>  David> + const uint32 c0 = (c % 100) << 1;
>  David> + const uint32 c1 = (c / 100) << 1;
>  David> + const uint32 d0 = (d % 100) << 1;
>  David> + const uint32 d1 = (d / 100) << 1;
>  David> +
>  David> + char *pos = a + olength - i;
>  David> +
>  David> + value /= 1;
>  David> +
>  David> + memcpy(pos - 2, DIGIT_TABLE + c0, 2);
>  David> + memcpy(pos - 4, DIGIT_TABLE + c1, 2);
>  David> + memcpy(pos - 6, DIGIT_TABLE + d0, 2);
>  David> + memcpy(pos - 8, DIGIT_TABLE + d1, 2);
>  David> + i += 8;
>  David> + }
> 
> For the 32-bit case, there's no point in doing an 8-digit divide
> specially, it doesn't save any time. It's sufficient to just change
> 
>  David> + if (value >= 1)
> 
> to  while(value >= 1)

Done.

> in order to process 4 digits at a time.
> 
>  David> + for(int i = 0; i < minwidth - len; i++)
>  David> + {
>  David> + memcpy(str + i, DIGIT_TABLE, 1);
>  David> + }
> 
> Should be:
> memset(str, '0', minwidth-len);

Done.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
>From 4407b0771cd53890acf41ffee8908d1a701c52c0 Mon Sep 17 00:00:00 2001
From: David Fetter 
Date: Sun, 15 Sep 2019 00:06:29 -0700
Subject: [PATCH v10] Make int4 and int8 operations more efficent
To: hackers
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="2.21.0"

This is a multi-part message in MIME format.
--2.21.0
Content-Type: text/plain; charset=UTF-8; format=fixed
Content-Transfer-Encoding: 8bit


- Output routines now do more digits per iteration, and
- Code determines the number of decimal digits in int4/int8 efficiently
- Split off pg_ltoa_n from pg_ltoa
- Use same to make other functions shorter

diff --git a/src/backend/access/common/printsimple.c b/src/backend/access/common/printsimple.c
index 651ade14dd..5c5b6d33b2 100644
--- a/src/backend/access/common/printsimple.c
+++ b/src/backend/access/common/printsimple.c
@@ -112,7 +112,7 @@ printsimple(TupleTableSlot *slot, DestReceiver *self)
 			case INT8OID:
 {
 	int64		num = DatumGetInt64(value);
-	char		str[23];	/* sign, 21 digits and '\0' */
+	char		str[MAXINT8LEN + 1];
 
 	pg_lltoa(num, str);
 	pq_sendcountedtext(&buf, str, strlen(str), false);
diff --git a/src/backend/utils/adt/int8.c b/src/backend/utils/adt/int8.c
index 0ff9394a2f..6230807906 100644
--- a/src/backend/utils/adt/int8.c
+++ b/src/backend/utils/adt/int8.c
@@ -27,8 +27,6 @@
 #include "utils/builtins.h"
 
 
-#define MAXINT8LEN		25
-
 typedef struct
 {
 	int64		current;
diff --git a/src/backend/utils/adt/numutils.c b/src/backend/utils/adt/numutils.c
index 70138feb29..babc6f6166 100644
--- a/src/backend/utils/adt/numutils.c
+++ b/src/backend/utils/adt/numutils.c
@@ -20,6 +20,68 @@
 
 #include "common/int.h"
 #include "utils/builtins.h"
+#include "port/pg_bitutils.h"
+
+/*
+ * A table of all two-digit numbers. This is used to speed up decimal digit
+ * generation by copying pairs of digits into the final output.
+ */
+static const char DIGIT_TABLE[200] =
+"00" "01" "02" "03" "04" "05" "06" "07" "08" "09"
+"10" "11" "12" "13" "14" "15" "16" "17" "18" "19"
+"20" "21" "22" "23" "24" "25" "26" "27" "28" "29"
+"30" "31" "32" "33" "34" "35" "36" "37" "38" "39"
+"40" "41" "42" "43" "44" "45" "46" "47" "48" "49"
+"50" "51" "52" "53" "54" "55" "56" "57" "58" "59"
+"60" "61" "62" "63" "64" "65" "66" "67" "68" "69"
+"70" "71" "72" "73" "74" "75" "76" "77" "78" "79"
+"80" "81" "82" "83" "84" "85" "86" "87" "88" "89"
+"90" "91" "92" "93" "94" "95" "96" "97" "98" "99";
+
+/*
+ * Adapted from http://graphics.stanford.edu/~seander/bithacks.html#IntegerLog10
+ */
+static inline uint32
+decimalLength32(const uint32 v)
+{
+	uint32			t;
+	static uint32	PowersOfTen[] =
+	{1,10,100,
+	 1000, 1, 10,
+	 100,  1000,  1,
+	 10};
+	/*
+	 * Compute base-10 logarithm by dividing the base-2 logarithm
+	 * by a good-enough approximation of the base-2 logarithm of 10
+	 */
+	t = (pg_leftmost_one_pos32(v) + 1)*1233/4096;
+	return t + (v >= PowersOfTen[t]);
+}
+
+static inline uint32
+decimalLength64(const uint64_t v)
+{
+	uint32			t;
+	static uint64_t	PowersOfTen[] = {
+		UINT64CONST(1),   UINT64CONST(10),
+

Re: A problem about partitionwise join

2019-09-20 Thread Dilip Kumar
On Fri, Sep 20, 2019 at 2:33 PM Richard Guo  wrote:
>
> Hi Dilip,
>
> Thank you for reviewing this patch.
>
> On Fri, Sep 20, 2019 at 12:48 PM Dilip Kumar  wrote:
>>
>> On Thu, Aug 29, 2019 at 3:15 PM Richard Guo  wrote:
>> >
>> >
>> > Attached is a patch as an attempt to address this issue. The idea is
>> > quite straightforward. When building partition info for joinrel, we
>> > generate any possible EC-derived joinclauses of form 'outer_em =
>> > inner_em', which will be used together with the original restrictlist to
>> > check if there exists an equi-join condition for each pair of partition
>> > keys.
>> >
>> > Any comments are welcome!
>>  /*
>> + * generate_join_implied_equalities_for_all
>> + *   Create any EC-derived joinclauses of form 'outer_em = inner_em'.
>> + *
>> + * This is used when building partition info for joinrel.
>> + */
>> +List *
>> +generate_join_implied_equalities_for_all(PlannerInfo *root,
>> + Relids join_relids,
>> + Relids outer_relids,
>> + Relids inner_relids)
>>
>> I think we need to have more detailed comments about why we need this
>> separate function, we can also explain that
>> generate_join_implied_equalities function will avoid
>> the join clause if EC has the constant but for partition-wise join, we
>> need that clause too.
>
>
> Thank you for the suggestion.
>
>>
>>
>> + while ((i = bms_next_member(matching_ecs, i)) >= 0)
>> + {
>> + EquivalenceClass *ec = (EquivalenceClass *) list_nth(root->eq_classes, i);
>> + List*outer_members = NIL;
>> + List*inner_members = NIL;
>> + ListCell   *lc1;
>> +
>> + /* Do not consider this EC if it's ec_broken */
>> + if (ec->ec_broken)
>> + continue;
>> +
>> + /* Single-member ECs won't generate any deductions */
>> + if (list_length(ec->ec_members) <= 1)
>> + continue;
>> +
>>
>> I am wondering isn't it possible to just process the missing join
>> clause?  I mean 'generate_join_implied_equalities' has only skipped
>> the ECs which has const so
>> can't we create join clause only for those ECs and append it the
>> "Restrictlist" we already have?  I might be missing something?
>
>
> For ECs without const, 'generate_join_implied_equalities' also has
> skipped some join clauses since it only selects the joinclause with
> 'best_score' between outer members and inner members. And the missing
> join clauses are needed to generate partitionwise join. That's why the
> query below cannot be planned as partitionwise join, as we have missed
> joinclause 'foo.k = bar.k'.

oh right.  I missed that part.

> select * from p as foo join p as bar on foo.k = bar.val and foo.k = bar.k;
>
> And yes 'generate_join_implied_equalities_for_all' will create join
> clauses that have existed in restrictlist. I think it's OK since the
> same RestrictInfo deduced from EC will share the same pointer and
> list_concat_unique_ptr will make sure there are no duplicates in the
> restrictlist used by have_partkey_equi_join.
>
ok

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Efficient output for integer types

2019-09-20 Thread Andrew Gierth
> "David" == David Fetter  writes:

 David> +static inline uint32
 David> +decimalLength64(const uint64_t v)

Should be uint64, not uint64_t.

Also return an int, not a uint32.

For int vs. int32, my own inclination is to use "int" where the value is
just a (smallish) number, especially one that will be used as an index
or loop count, and use "int32" when it actually matters that it's 32
bits rather than some other size. Other opinions may differ.

 David> +{
 David> +   uint32  t;
 David> +   static uint64_t PowersOfTen[] = {

uint64 not uint64_t here too.

 David> +int32
 David> +pg_ltoa_n(uint32 value, char *a)

If this is going to handle only unsigned values, it should probably be
named pg_ultoa_n.

 David> +   uint32  i = 0, adjust = 0;

"adjust" is not assigned anywhere else. Presumably that's from previous
handling of negative numbers?

 David> +   memcpy(a, "0", 1);

 *a = '0';  would suffice.

 David> +   i += adjust;

Superfluous?

 David> +   uint32_tuvalue = (uint32_t)value;

uint32 not uint32_t.

 David> +   int32   len;

See above re. int vs. int32.

 David> +   uvalue = (uint32_t)0 - (uint32_t)value;

Should be uint32 not uint32_t again.

For anyone wondering, I suggested this to David in place of the ugly
special casing of INT32_MIN. This method avoids the UB of doing (-value)
where value==INT32_MIN, and is nevertheless required to produce the
correct result:

1. If value < 0, then ((uint32)value) is (value + UINT32_MAX + 1)
2. (uint32)0 - (uint32)value
  becomes  (UINT32_MAX+1)-(value+UINT32_MAX+1)
  which is (-value) as required

 David> +int32
 David> +pg_lltoa_n(uint64_t value, char *a)

Again, if this is doing unsigned, then it should be named pg_ulltoa_n

 David> +   if (value == PG_INT32_MIN)

This being inconsistent with the others is not nice.

-- 
Andrew (irc:RhodiumToad)