Re: a misbehavior of partition row movement (?)

2021-02-19 Thread Masahiko Sawada
On Mon, Feb 15, 2021 at 10:37 PM Amit Langote  wrote:
>
> Thank you for looking at this.
>
> On Mon, Feb 15, 2021 at 4:12 PM Masahiko Sawada  wrote:
> > On Wed, Jan 20, 2021 at 7:04 PM Amit Langote  
> > wrote:
> > > This shows that the way we've made these triggers behave in general
> > > can cause some unintended behaviors for foreign keys during
> > > cross-partition updates.  I started this thread to do something about
> > > that and sent a patch to prevent cross-partition updates at all when
> > > there are foreign keys pointing to it.  As others pointed out, that's
> > > not a great long-term solution to the problem, but that's what we may
> > > have to do in the back-branches if anything at all.
> >
> > I've started by reviewing the patch for back-patching that the first
> > patch you posted[1].
> >
> > +   for (i = 0; i < trigdesc->numtriggers; i++)
> > +   {
> > +   Trigger *trigger = &trigdesc->triggers[i];
> > +
> > +   if (trigger->tgisinternal &&
> > +   OidIsValid(trigger->tgconstrrelid) &&
> > +   trigger->tgfoid == F_RI_FKEY_CASCADE_DEL)
> > +   {
> > +   found = true;
> > +   break;
> > +   }
> > +   }
> >
> > IIUC the above checks if the partition table is referenced by a
> > foreign key constraint on another table with ON DELETE CASCADE option.
> > I think we should prevent cross-partition update also when ON DELETE
> > SET NULL and ON DELETE SET DEFAULT. For example, with the patch, a
> > tuple in a partition table is still updated to NULL when
> > cross-partition update as follows:
> >
> > postgres=# create table p (a numeric primary key) partition by list (a);
> > CREATE TABLE
> > postgres=# create table p1 partition of p for values in (1);
> > CREATE TABLE
> > postgres=# create table p2 partition of p for values in (2);
> > CREATE TABLE
> > postgres=# insert into p values (1);
> > INSERT 0 1
> > postgres=# create table q (a int references p(a) on delete set null);
> > CREATE TABLE
> > postgres=# insert into q values (1);
> > INSERT 0 1
> > postgres=# update p set a = 2;
> > UPDATE 1
> > postgres=# table p;
> >  a
> > ---
> >  2
> > (1 row)
> >
> > postgres=# table q;
> >  a
> > ---
> >
> > (1 row)
>
> Yeah, I agree that's not good.
>
> Actually, I had meant to send an updated version of the patch to apply
> in back-branches that would prevent such a cross-partition update, but
> never did since starting to work on the patch for HEAD.  I have
> attached it here.

Thank you for updating the patch!

>
> Regarding the patch, I would have liked if it only prevented the
> update when the foreign key that would be violated by the component
> delete references the update query's main target relation.  If the
> foreign key references the source partition directly, then the delete
> would be harmless. However there's not enough infrastructure in v12,
> v13 branches to determine that, which makes back-patching this a bit
> hard.  For example, there's no way to tell in the back-branches if the
> foreign-key-enforcing triggers of a leaf partition have descended from
> the parent table.  IOW, I am not so sure anymore if we should try to
> do anything in the back-branches.

Hmm I'm not sure the necessity of figuring out foreign-key-enforcing
triggers of a leaf partition have descended from the parent table. I
might be missing something but even if the foreign key references the
leaf partition directly, the same problem could happen if doing a
cross-partition update, is that right?

Also, the updated patch prevents a cross-partition update even when
the foreign key references another column of itself. This kind of
cross-update works as expected for now so it seems not to need to
disallow that. What do you think? The scenario is:

create table p (a int primary key, b int references p(a) on delete
cascade) partition by list (a);
create table p1 partition of p for values in (1);
create table p2 partition of p for values in (2);
insert into p values (1, 1);
update p set a = 2, b = 2 where a = 1;
ERROR:  cannot move row being updated to another partition
DETAIL:  Moving the row may cause a foreign key involving the source
partition to be violated.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: Problem with accessing TOAST data in stored procedures

2021-02-19 Thread Konstantin Knizhnik



On 19.02.2021 10:47, Pavel Stehule wrote:



pá 19. 2. 2021 v 8:39 odesílatel Konstantin Knizhnik 
mailto:k.knizh...@postgrespro.ru>> napsal:




On 19.02.2021 10:14, Pavel Stehule wrote:



pá 19. 2. 2021 v 7:51 odesílatel Konstantin Knizhnik
mailto:k.knizh...@postgrespro.ru>>
napsal:



On 18.02.2021 20:10, Pavel Stehule wrote:

This has a negative impact on performance - and a lot of
users use procedures without transaction control. So it
doesn't look like a good solution.

I am more concentrated on the Pg 14 release, where the work
with SPI is redesigned, and I hope so this issue is fixed
there. For older releases, I don't know. Is this issue
related to Postgres or it is related to PgPro only? If it is
related to community pg, then we should fix and we should
accept not too good performance, because there is no better
non invasive solution. If it is PgPro issue (because there
are ATX support) you can fix it (or you can try backport the
patch

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=ee895a655ce4341546facd6f23e3e8f2931b96bf
). You have more possibilities on PgPro code base.


Sorry, it is not PgPro specific problem and recent master
suffers from this bug as well.
In the original bug report there was simple scenario of
reproducing the problem:

CREATE TABLE toasted(id serial primary key, data text);
INSERT INTO toasted(data) VALUES((SELECT
string_agg(random()::text,':') FROM generate_series(1, 1000)));
INSERT INTO toasted(data) VALUES((SELECT
string_agg(random()::text,':') FROM generate_series(1, 1000)));
DO $$ DECLARE v_r record; BEGIN FOR v_r in SELECT data FROM
toasted LOOP INSERT INTO toasted(data)
VALUES(v_r.data);COMMIT;END LOOP;END;$$;


can you use new procedure_resowner?


Sorry, I do not understanf your suggestion.
How procedure_resowner can help to solve this problem?


This is just an idea - I think the most correct with zero performance 
impact is keeping snapshot, and this can be stored in procedure_resowner.


The fundamental question is if we want or allow more snapshots per 
query. The implementation is a secondary issue.


I wonder if it is correct from logical point of view.
If we commit transaction in stored procedure, then we actually 
implicitly start new transaction.
And new transaction should have new snapshot. Otherwise its behavior 
will change.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Problem with accessing TOAST data in stored procedures

2021-02-19 Thread Pavel Stehule
pá 19. 2. 2021 v 9:08 odesílatel Konstantin Knizhnik <
k.knizh...@postgrespro.ru> napsal:

>
>
> On 19.02.2021 10:47, Pavel Stehule wrote:
>
>
>
> pá 19. 2. 2021 v 8:39 odesílatel Konstantin Knizhnik <
> k.knizh...@postgrespro.ru> napsal:
>
>>
>>
>> On 19.02.2021 10:14, Pavel Stehule wrote:
>>
>>
>>
>> pá 19. 2. 2021 v 7:51 odesílatel Konstantin Knizhnik <
>> k.knizh...@postgrespro.ru> napsal:
>>
>>>
>>>
>>> On 18.02.2021 20:10, Pavel Stehule wrote:
>>>
>>> This has a negative impact on performance - and a lot of users use
>>> procedures without transaction control. So it doesn't look like a good
>>> solution.
>>>
>>> I am more concentrated on the Pg 14 release, where the work with SPI is
>>> redesigned, and I hope so this issue is fixed there. For older releases, I
>>> don't know. Is this issue related to Postgres or it is related to PgPro
>>> only? If it is related to community pg, then we should fix and we should
>>> accept not too good performance, because there is no better non invasive
>>> solution. If it is PgPro issue (because there are ATX support) you can fix
>>> it (or you can try backport the patch
>>> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=ee895a655ce4341546facd6f23e3e8f2931b96bf
>>> ). You have more possibilities on PgPro code base.
>>>
>>>
>>> Sorry, it is not PgPro specific problem and recent master suffers from
>>> this bug as well.
>>> In the original bug report there was simple scenario of reproducing the
>>> problem:
>>>
>>> CREATE TABLE toasted(id serial primary key, data text);
>>> INSERT INTO toasted(data) VALUES((SELECT string_agg(random()::text,':')
>>> FROM generate_series(1, 1000)));
>>> INSERT INTO toasted(data) VALUES((SELECT string_agg(random()::text,':')
>>> FROM generate_series(1, 1000)));
>>> DO $$ DECLARE v_r record; BEGIN FOR v_r in SELECT data FROM toasted LOOP
>>> INSERT INTO toasted(data) VALUES(v_r.data);COMMIT;END LOOP;END;$$;
>>>
>>
>> can you use new procedure_resowner?
>>
>> Sorry, I do not understanf your suggestion.
>> How procedure_resowner can help to solve this problem?
>>
>
> This is just an idea - I think the most correct with zero performance
> impact is keeping snapshot, and this can be stored in procedure_resowner.
>
> The fundamental question is if we want or allow more snapshots per query.
> The implementation is a secondary issue.
>
>
> I wonder if it is correct from logical point of view.
> If we commit transaction in stored procedure, then we actually implicitly
> start new transaction.
> And new transaction should have new snapshot. Otherwise its behavior will
> change.
>

I have no problem with this. I have a problem with cycle implementation -
when I iterate over some result, then this result should be consistent over
all cycles.  In other cases, the behaviour is not deterministic.



> --
> Konstantin Knizhnik
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
>


Re: ERROR: "ft1" is of the wrong type.

2021-02-19 Thread Kyotaro Horiguchi
At Thu, 18 Feb 2021 17:17:37 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> I can add some regression tests to cover all the live cases. That
> could reveal no-longer-used combinations.

The attached is that.

ATT_VIEW is used for "CREATE OR REPLACE view" and checked against
earlier in DefineVirtualRelation. But we can add a test to make sure
that is checked anywhere.

All other values can be exercised.

ATT_TABLE | ATT_MATVIEW
ATT_TABLE | ATT_MATVIEW | ATT_INDEX | ATT_PARTITIONED_INDEX
ATT_TABLE | ATT_MATVIEW | ATT_INDEX | ATT_PARTITIONED_INDEX |
 ATT_FOREIGN_TABLE
ATT_TABLE | ATT_MATVIEW | ATT_FOREIGN_TABLE
ATT_TABLE | ATT_MATVIEW | ATT_INDEX | ATT_FOREIGN_TABLE
ATT_TABLE | ATT_PARTITIONED_INDEX
ATT_TABLE | ATT_VIEW | ATT_MATVIEW | ATT_INDEX
ATT_TABLE | ATT_VIEW | ATT_FOREIGN_TABLE:
ATT_FOREIGN_TABLE

These are provoked by the following commands respectively:

  ALTER TABLE  CLUSTER ON
  ALTER TABLE  SET TABLESPACE
  ALTER TABLE  ALTER COLUMN  SET STATISTICS
  ALTER TABLE  ALTER COLUMN  SET STORGE
  ALTER TABLE  ALTER COLUMN  SET()
  ALTER TABLE  ATTACH PARTITION
  ALTER TABLE/INDEX  SET/RESET
  ALTER TABLE  ALTER  SET DEFAULT
  ALTER TABLE/INDEX  ALTER COLLATION ..REFRESH VERSION
  ALTER TABLE  OPTIONS ()

The following three errors are already excised.

ATT_TABLE
ATT_TABLE | ATT_FOREIGN_TABLE
ATT_TABLE | ATT_COMPOSITE_TYPE | ATT_FOREIGN_TABLE:


By the way, I find this as somewhat mystifying. I'm not sure it worth
fixing though..

ALTER MATERIALIZED VIEW mv1 ALTER COLUMN a SET DEFAULT 1;
ERROR:  "mv1" is not a table, view, or foreign table

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out
index 0ce6ee4622..4a367c9609 100644
--- a/src/test/regress/expected/alter_table.out
+++ b/src/test/regress/expected/alter_table.out
@@ -6,6 +6,7 @@ SET client_min_messages TO 'warning';
 DROP ROLE IF EXISTS regress_alter_table_user1;
 RESET client_min_messages;
 CREATE USER regress_alter_table_user1;
+CREATE VIEW at_v1 AS SELECT 1 as a;
 --
 -- add attribute
 --
@@ -120,6 +121,9 @@ ALTER INDEX attmp_idx ALTER COLUMN 4 SET STATISTICS 1000;
 ERROR:  column number 4 of relation "attmp_idx" does not exist
 ALTER INDEX attmp_idx ALTER COLUMN 2 SET STATISTICS -1;
 DROP TABLE attmp;
+-- test that the command correctly complains for the object of a wrong type
+ALTER TABLE at_v1 ALTER COLUMN a SET STATISTICS 0; -- ERROR
+ERROR:  "at_v1" is not a table, materialized view, index, partitioned index, or foreign table
 --
 -- rename - check on both non-temp and temp tables
 --
@@ -1186,6 +1190,11 @@ select * from def_test;
 alter table def_test alter column c1 set default 'wrong_datatype';
 ERROR:  invalid input syntax for type integer: "wrong_datatype"
 alter table def_test alter column c2 set default 20;
+-- set defaults to an incorrect object: this should fail
+create materialized view def_tmp_mv as select 1 as a;
+alter table def_tmp_mv alter a set default 0;
+ERROR:  "def_tmp_mv" is not a table, view, or foreign table
+drop materialized view def_tmp_mv;
 -- set defaults on a non-existent column: this should fail
 alter table def_test alter column c3 set default 30;
 ERROR:  column "c3" of relation "def_test" does not exist
@@ -2076,6 +2085,9 @@ Indexes:
 "at_part_2_a_idx" btree (a)
 "at_part_2_b_idx" btree (b)
 
+-- check if the command correctly complains for the object of a wrong type 
+alter table at_partitioned_a_idx set (dummy = 1); -- ERROR
+ERROR:  "at_partitioned_a_idx" is not a table, view, materialized view, or index
 drop table at_partitioned;
 -- Alter column type when no table rewrite is required
 -- Also check that comments are preserved
@@ -2212,6 +2224,9 @@ Indexes:
  a  | text| yes  | a  | external | 
 btree, for table "public.test_storage"
 
+-- test that SET STORAGE correctly complains for the object of a wrong type
+alter table at_v1 alter column a set storage plain; -- ERROR
+ERROR:  "at_v1" is not a table, materialized view, or foreign table
 -- ALTER COLUMN TYPE with a check constraint and a child table (bug #13779)
 CREATE TABLE test_inh_check (a float check (a > 10.2), b float);
 CREATE TABLE test_inh_check_child() INHERITS(test_inh_check);
@@ -2684,6 +2699,9 @@ select * from my_locks order by 1;
 (2 rows)
 
 commit;
+-- test that the command corectly complains for the object of a wrong type
+alter table at_v1 alter column a set (dummy = 1);
+ERROR:  "at_v1" is not a table, materialized view, index, or foreign table
 begin; alter table alterlock alter column f2 set storage extended;
 select * from my_locks order by 1;
   relname  |max_lockmode 
@@ -4110,6 +4128,9 @@ ERROR:  remainder for hash partition must be less than modulus
 ALTER TABLE hash_parted ATTACH PARTITION fail_part FOR VALUES WITH (MODULUS 3, REMAINDER 2);
 ERROR:  every hash partition modulus must be a factor of the next larger modulus
 DROP TABLE fail_part;
+-- check that attach partition co

Re: [PoC] Non-volatile WAL buffer

2021-02-19 Thread Konstantin Knizhnik

Thank you for your feedback.

On 19.02.2021 6:25, Tomas Vondra wrote:

On 1/22/21 5:04 PM, Konstantin Knizhnik wrote:

...

I have heard from several DBMS experts that appearance of huge and
cheap non-volatile memory can make a revolution in database system
architecture. If all database can fit in non-volatile memory, then we
do not need buffers, WAL, ...>
But although  multi-terabyte NVM announces were made by IBM several
years ago, I do not know about some successful DBMS prototypes with new
architecture.

I tried to understand why...


IMHO those predictions are a bit too optimistic, because they often
assume PMEM behavior is mostly similar to DRAM, except for the extra
persistence. But that's not quite true - throughput with PMEM is much
lower

Actually it is not completely true.
There are several types of NVDIMMs.
Most popular now is NVDIMM-N which is just combination of DRAM and flash.
Speed it the same as of normal DRAM, but size of such memory is also 
comparable with DRAM.

So I do not think that it is perspective approach.
And definitely speed of Intel Optane memory is much slower than of DRAM.

But the main advantage of PMEM from my point of view is that it allows
to avoid write-ahead logging at all!

No, PMEM certainly does not allow avoiding write-ahead logging - we
still need to handle e.g. recovery after a crash, when the data files
are in unknown / corrupted state.


It is possible to avoid write-ahead logging if we use special algorithms 
(like PMwCAS)

which ensures atomicity of updates.

The problem with removing buffer manager and just writing everything
directly to PMEM is the worse latency/throughput (compared to DRAM).
It's probably much more efficient to combine multiple writes into RAM
and then do one (much slower) write to persistent storage, than pay the
higher latency for every write.

It might make sense for data sets that are larger than DRAM but can fit
into PMEM. But that seems like fairly rare case, and even then it may be
more efficient to redesign the schema to fit into RAM somehow (sharding,
partitioning, ...).


Certainly avoid buffering will make sense only if speed of accessing 
PMEM will be comparable with DRAM.

So I have to patch code of this library instead of just using it:

   g...@github.com:postgrespro/bztree.git

I have not tested yet most iterating case: access to PMEM through PMDK.
And I do not have hardware for such tests.
But first results are also seem to be interesting: PMwCAS is kind of
lockless algorithm and it shows much better scaling at
NUMA host comparing with standard Postgres.

I have done simple parallel insertion test: multiple clients are
inserting data with random keys.
To make competition with vanilla Postgres more honest I used unlogged
table:

create unlogged table t(pk int, payload int);
create index on t using bztree(pk);

randinsert.sql:
insert into t (payload,pk) values
(generate_series(1,1000),random()*10);

pgbench -f randinsert.sql -c N -j N -M prepared -n -t 1000 -P 1 postgres

So each client is inserting one million records.
The target system has 160 virtual and 80 real cores with 256GB of RAM.
Results (TPS) are the following:

N  nbtree      bztree
1   540      455
10     993    2237
100     1479    5025

So bztree is more than 3 times faster for 100 clients.
Just for comparison: result for inserting in this table without index is
10k TPS.

I'm not familiar with bztree, but I agree novel indexing structures are
an interesting topic on their own. I only quickly skimmed the bztree
paper, but it seems it might be useful even on DRAM (assuming it will
work with replication etc.).

The other "problem" with placing data files (tables, indexes) on PMEM
and making this code PMEM-aware is that these writes generally happen
asynchronously in the background, so the impact on transaction rate is
fairly low. This is why all the patches in this thread try to apply PMEM
on the WAL logging / flushing, which is on the critical path.


I want to make an update on my prototype.
Unfortunately my attempt to use bztree with PMEM failed,
because of two problems:

1. Used libpmemobj/bztree libraries are not compatible with Postgres 
architecture.
Them support concurrent access, but by multiple threads within one 
process (widely use thread-local variables).
The traditional Postgres approach (initialize shared data structures in 
postmaster
(shared_preload_libraries) and inherit it by forked child processes) 
doesn't work for libpmemobj.

If child doesn't open pmem itself, then any access to it cause crash.
And in case of openning pmem by child, it is assigned different virtual 
memory address.
But bztree and pmwcas implementations expect that addresses are the same 
in all clients.


2. There is some bug in bztree/pmwcas implementation which cause its own 
test to hang in case of multithreaded
access in persistence mode. I tried to find the reason of the problem 
but didn;t succeed yet: PMwCAS implementation is very

Re: Extensibility of the PostgreSQL wire protocol

2021-02-19 Thread Kuntal Ghosh
On Thu, Feb 18, 2021 at 9:32 PM Jan Wieck  wrote:
>
And, here is how it looks with the following configuration:
telnet_srv.port = 1433
telnet_srv.listen_addresses = '*'

telnet localhost 1433


   master
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
PostgreSQL Telnet Interface

database name: postgres
username: kuntal
password: changeme
> select 1;
?column?

1

SELECT 1
> select 1/0;
Message: ERROR - division by zero

Few comments in the extension code (although experimental):

1. In telnet_srv.c,
+ static intpe_port;
..
+   DefineCustomIntVariable("telnet_srv.port",
+   "Telnet server port.",
+   NULL,
+   &pe_port,
+   pe_port,
+   1024,
+   65536,
+   PGC_POSTMASTER,
+   0,
+   NULL,
+   NULL,
+   NULL);

The variable pe_port should be initialized to a value which is > 1024
and < 65536. Otherwise, the following assert will fail,
TRAP: FailedAssertion("newval >= conf->min", File: "guc.c", Line:
5541, PID: 12100)

2. The function pq_putbytes shouldn't be used by anyone other than
old-style COPY out.
+   pq_putbytes(msg, strlen(msg));
Otherwise, the following assert will fail in the same function:
/* Should only be called by old-style COPY OUT */
Assert(DoingCopyOut);

-- 
Thanks & Regards,
Kuntal Ghosh
Amazon Web Services




Re: [WIP] UNNEST(REFCURSOR): allowing SELECT to consume data from a REFCURSOR

2021-02-19 Thread Dent John
Hi Massimo,

Happy to help. And actually, end user (i.e., developer) feedback on the 
feature’s usefulness is probably one of the more important contributions.

d.

> On 10 Feb 2021, at 08:57, Massimo Fidanza  wrote:
> 
> Hi John,
> 
> I never build postgresql from source, so I must get some information on how 
> to apply your patch and do some test. I can't review your code because I know 
> nothing about Postgresql internals and just basic C. I am mainly a PL/SQL 
> programmer, with experience with PHP, Python and Javascript. If I can give 
> some contribution I will be happy, but I need some help.
> 
> Massimo
> 
> Il giorno dom 7 feb 2021 alle ore 22:35 Dent John  > ha scritto:
> Hi Massimo,
> 
> Thanks for the interest, and my apologies for the late reply.
> 
> I’m not particularly abandoning it, but I don’t have particular reason to 
> make further changes at the moment. Far as I’m concerned it works, and the 
> main question is whether it is acceptable and useful.
> 
> I’d be happy if you have feedback that evolves it or might push it up the 
> queue for commitfest review.
> 
> d.
> 
>> On 18 Jan 2021, at 23:09, Massimo Fidanza > > wrote:
>> 
>> This is an interesting feature, but it seems that the author has abandoned 
>> development, what happens now? Will this be postponed from commitfest to 
>> commitfest and never be taken over by anyone?
>> 
>> Massimo.
>> 
>> Il giorno ven 6 mar 2020 alle ore 22:36 Dent John > > ha scritto:
>> > On 22 Feb 2020, at 10:38, Dent John mailto:de...@qqdd.eu>> 
>> > wrote:
>> > 
>> >> On 18 Feb 2020, at 03:03, Thomas Munro > >> > wrote:
>> >> 
>> >> From the trivialities department, I see a bunch of warnings about
>> >> local declaration placement (we're still using C90 rules for those by
>> >> project policy):
>> >> 
>> >> […]
>> > 
>> > […]
>> 
>> My bad. I missed on declaration. 
>> 
>> Another patch attached.
>> 
>> d.
> 



Re: Extend more usecase for planning time partition pruning and init partition pruning.

2021-02-19 Thread Andy Fan
On Mon, Feb 8, 2021 at 3:43 PM Andy Fan  wrote:

>
>
> On Mon, Jan 25, 2021 at 10:21 AM Andy Fan 
> wrote:
>
>>
>>
>> On Sun, Jan 24, 2021 at 6:34 PM Andy Fan 
>> wrote:
>>
>>> Hi:
>>>
>>>  I recently found a use case like this.  SELECT * FROM p, q WHERE
>>> p.partkey =
>>>  q.colx AND (q.colx = $1 OR q.colx = $2); Then we can't do either
>>> planning time
>>>  partition prune or init partition prune.  Even though we have run-time
>>>  partition pruning work at last, it is too late in some cases since we
>>> have
>>>  to init all the plan nodes in advance.  In my case, there are 10+
>>>  partitioned relation in one query and the execution time is short, so
>>> the
>>>  init plan a lot of plan nodes cares a lot.
>>>
>>> The attached patches fix this issue. It just get the "p.partkey = q.colx"
>>> case in root->eq_classes or rel->joinlist (outer join), and then check
>>> if there
>>> is some baserestrictinfo in another relation which can be used for
>>> partition
>>> pruning. To make the things easier, both partkey and colx must be Var
>>> expression in implementation.
>>>
>>> - v1-0001-Make-some-static-functions-as-extern-and-extend-C.patch
>>>
>>> Just some existing refactoring and extending ChangeVarNodes to be able
>>> to change var->attno.
>>>
>>> - v1-0002-Build-some-implied-pruning-quals-to-extend-the-us.patch
>>>
>>> Do the real job.
>>>
>>> Thought?
>>>
>>>
>>>
>>> --
>>> Best Regards
>>> Andy Fan (https://www.aliyun.com/)
>>>
>>
>>
>> Some results from this patch.
>>
>> create table p (a int, b int, c character varying(8)) partition by
>> list(c);
>> create table p1  partition of p for values in ('01');
>> create table p2  partition of p for values in ('02');
>> create table p3  partition of p for values in ('03');
>> create table q (a int, c character varying(8), b int) partition by
>> list(c);
>> create table q1  partition of q for values in ('01');
>> create table q2  partition of q for values in ('02');
>> create table q3  partition of q for values in ('03');
>>
>> Before the patch:
>> postgres=# explain (costs off) select * from p inner join q on p.c = q.c
>> and q.c > '02';
>>  QUERY PLAN
>> 
>>  Hash Join
>>Hash Cond: ((p.c)::text = (q.c)::text)
>>->  Append
>>  ->  Seq Scan on p1 p_1
>>  ->  Seq Scan on p2 p_2
>>  ->  Seq Scan on p3 p_3
>>->  Hash
>>  ->  Seq Scan on q3 q
>>Filter: ((c)::text > '02'::text)
>> (9 rows)
>>
>> After the patch:
>>
>>  QUERY PLAN
>> 
>>  Hash Join
>>Hash Cond: ((p.c)::text = (q.c)::text)
>>->  Seq Scan on p3 p
>>->  Hash
>>  ->  Seq Scan on q3 q
>>Filter: ((c)::text > '02'::text)
>> (6 rows)
>>
>>
>> Before the patch:
>> postgres=# explain (costs off) select * from p inner join q on p.c = q.c
>> and (q.c = '02' or q.c = '01');
>>  QUERY PLAN
>>
>> 
>>  Hash Join
>>Hash Cond: ((p.c)::text = (q.c)::text)
>>->  Append
>>  ->  Seq Scan on p1 p_1
>>  ->  Seq Scan on p2 p_2
>>  ->  Seq Scan on p3 p_3
>>->  Hash
>>  ->  Append
>>->  Seq Scan on q1 q_1
>>  Filter: (((c)::text = '02'::text) OR ((c)::text
>> = '01'::text))
>>->  Seq Scan on q2 q_2
>>  Filter: (((c)::text = '02'::text) OR ((c)::text
>> = '01'::text))
>> (12 rows)
>>
>> After the patch:
>>  QUERY PLAN
>>
>> 
>>  Hash Join
>>Hash Cond: ((p.c)::text = (q.c)::text)
>>->  Append
>>  ->  Seq Scan on p1 p_1
>>  ->  Seq Scan on p2 p_2
>>->  Hash
>>  ->  Append
>>->  Seq Scan on q1 q_1
>>  Filter: (((c)::text = '02'::text) OR ((c)::text
>> = '01'::text))
>>->  Seq Scan on q2 q_2
>>  Filter: (((c)::text = '02'::text) OR ((c)::text
>> = '01'::text))
>> (11 rows)
>>
>> Before the patch:
>> postgres=# explain (costs off) select * from p left join q on p.c = q.c
>> where (q.c = '02' or q.c = '01');
>>  QUERY PLAN
>>
>> 
>>  Hash Join
>>Hash Cond: ((p.c)::text = (q.c)::text)
>>->  Append
>>  ->  Seq Scan on p1 p_1
>>  ->  Seq Scan on p2 p_2
>>  ->  Seq Scan on p3 p_3
>>->  Hash
>>  ->  Append
>>->  Seq Scan on q1 q_1
>>  Filter: (((c)::text = '02'::text) OR ((c)::text
>> = '01'::text))
>>->  S

Re: Parallel INSERT (INTO ... SELECT ...)

2021-02-19 Thread Amit Kapila
On Thu, Feb 18, 2021 at 11:05 AM Amit Langote  wrote:
>
> > > It also occurred to me that we can avoid pointless adding of
> > > partitions if the final plan won't use parallelism.  For that, the
> > > patch adds checking glob->parallelModeNeeded, which seems to do the
> > > trick though I don't know if that's the correct way of doing that.
> > >
> >
> > I'm not sure if's pointless adding partitions even in the case of NOT
> > using parallelism, because we may be relying on the result of
> > parallel-safety checks on partitions in both cases.
> > For example, insert_parallel.sql currently includes a test (that you
> > originally provided in a previous post) that checks a non-parallel
> > plan is generated after a parallel-unsafe trigger is created on a
> > partition involved in the INSERT.
> > If I further add to that test by then dropping that trigger and then
> > again using EXPLAIN to see what plan is generated, then I'd expect a
> > parallel-plan to be generated, but with the setrefs-v3.patch it still
> > generates a non-parallel plan. So I think the "&&
> > glob->parallelModeNeeded" part of test needs to be removed.
>
> Ah, okay, I didn't retest my case after making that change.
>

Greg has point here but I feel something on previous lines (having a
test of glob->parallelModeNeeded) is better. We only want to
invalidate the plan if the prepared plan is unsafe to execute next
time. It is quite possible that there are unsafe triggers on different
partitions and only one of them is dropped, so next time planning will
again yield to the same non-parallel plan. If we agree with that I
think it is better to add this dependency in set_plan_refs (along with
Gather node handling).

Also, if we agree that we don't have any cheap way to determine
parallel-safety of partitioned relations then shall we consider the
patch being discussed [1] to be combined here?

Amit L, do you agree with that line of thought, or you have any other ideas?

I feel we should focus on getting the first patch of Greg
(v18-0001-Enable-parallel-SELECT-for-INSERT-INTO-.-SELECT, along with
a test case patch) and Hou-San's patch discussed at [1] ready. Then we
can focus on the
v18-0003-Enable-parallel-INSERT-and-or-SELECT-for-INSERT-INTO. Because
even if we get the first patch that is good enough for some users.

What do you think?

[1] - 
https://www.postgresql.org/message-id/CAA4eK1K-cW7svLC2D7DHoGHxdAdg3P37BLgebqBOC2ZLc9a6QQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.




Re: Parallel INSERT (INTO ... SELECT ...)

2021-02-19 Thread Amit Kapila
On Fri, Feb 19, 2021 at 10:13 AM tsunakawa.ta...@fujitsu.com
 wrote:
>
> From: Greg Nancarrow 
> --
> On Mon, Jan 25, 2021 at 10:23 AM tsunakawa.ta...@fujitsu.com 
>  wrote:
> > (8)
> > +   /*
> > +* If the trigger type is RI_TRIGGER_FK, this indicates a 
> > FK exists in
> > +* the relation, and this would result in creation of new 
> > CommandIds
> > +* on insert/update/delete and this isn't supported in a 
> > parallel
> > +* worker (but is safe in the parallel leader).
> > +*/
> > +   trigtype = RI_FKey_trigger_type(trigger->tgfoid);
> > +   if (trigtype == RI_TRIGGER_FK)
> > +   {
> > +   if 
> > (max_parallel_hazard_test(PROPARALLEL_RESTRICTED, context))
> > +   return true;
> > +   }
> >
> > Here, RI_TRIGGER_FK should instead be RI_TRIGGER_PK, because RI_TRIGGER_FK 
> > triggers do not generate command IDs.  See RI_FKey_check() which is called 
> > in RI_TRIGGER_FK case.  In there, ri_PerformCheck() is called with the 
> > detectNewRows argument set to false, which causes CommandCounterIncrement() 
> > to not be called.
> >
>
> Hmmm, I'm not sure that you have read and interpreted the patch code 
> correctly.
> The existence of a RI_TRIGGER_FK trigger indicates the table has a foreign 
> key, and an insert into such a table will generate a new commandId (so we 
> must avoid that, as we don't currently have the technology to support sharing 
> of new command IDs across the participants in the parallel operation). This 
> is what the code comment says, It does not say that such a trigger generates 
> a new command ID.
>
> See Amit's updated comment here: 
> https://github.com/postgres/postgres/commit/0d32511eca5aec205cb6b609638ea67129ef6665
>
> In addition, the 2nd patch has an explicit test case for this (testing insert 
> into a table that has a FK).
> --
>
>
> First of all, I anticipate this parallel INSERT SELECT feature will typically 
> shine, and expected to work, in the ETL or ELT into a data warehouse or an 
> ODS for analytics.  Bearing that in mind, let me list some issues or 
> questions below.  But the current state of the patch would be of course 
> attractive in some workloads, so I don't think these are not necessarily 
> blockers.
>
>
> (1)
> According to the classic book "The Data Warehouse Toolkit" and the website 
> [1] by its author, the fact table (large transaction history) in the data 
> warehouse has foreign keys referencing to the dimension tables (small or 
> medium-sized master or reference data).  So, parallel insert will be 
> effective if it works when loading data into the fact table with foreign keys.
>
> To answer the above question, I'm assuming:
>
> CREATE TABLE some_dimension (key_col int PRIMARY KEY);
> CREATE TABLE some_fact (some_key int REFERENCES some_dimension);
> INSERT INTO some_fact SELECT ...;
>
>
> My naive question is, "why should new command IDs be generated to check 
> foreign key constraints in this INSERT case?  The check just reads the parent 
> (some_dimension table here)..."
>

It is quite possible what you are saying is correct but I feel that is
not this patch's fault. So, won't it better to discuss this in a
separate thread?

>
>
> (2)
> Likewise, dimension tables have surrogate keys that are typically implemented 
> as a sequence or an identity column.  It is suggested that even fact tables 
> sometimes (or often?) have surrogate keys.  But the current patch does not 
> parallelize the statement when the target table has a sequence or an identity 
> column.
>
> I was looking at the sequence code, and my naive (again) idea is that the 
> parallel leader and workers allocates numbers from the sequence 
> independently, and sets the largest number of them as the session's currval 
> at the end of parallel operation.  We have to note in the documentation that 
> gaps in the sequence numbers will arise and not used in parallel DML.
>

Good use case but again, I think this can be done as a separate patch.

>
> (3)
> As Hou-san demonstrated, the current patch causes the resulting table and 
> index to become larger when inserted in parallel than in inserted serially.  
> This could be a problem for analytics use cases where the table is just 
> inserted and read only afterwards.  We could advise the user to run REINDEX 
> CONCURRENTLY after loading data, but what about tables?
>

I think here you are talking about the third patch (Parallel Inserts).
I guess if one has run inserts parallelly from psql then also similar
behavior would have been observed. For tables, it might lead to better
results once we have the patch discussed at [1]. Actually, this needs
more investigation.

[1] - 
https://www.postgresql.org/message-id/20200508072545.GA9701%40telsasoft.com

-- 
With Re

RE: libpq debug log

2021-02-19 Thread iwata....@fujitsu.com
Hi all,

I update patch to v18. It has been fixed in response to Tsunakawa san's review.

> From: Tsunakawa, Takayuki/綱川 貴之 
> Sent: Friday, February 12, 2021 1:53 PM
> 
> (48)
>  
>  void PQtrace(PGconn *conn, FILE *stream);
>  
>   
> 
> + 
> +  Calls PQtraceSetFlags to output with or
> without a timestamp.
> + 
> +
>   
> 
> Why is this necessary?  Even if you decide to remove this change, can you
> share your idea on why you added this just in case?

In the previous patch, we described the PQTRACE_SUPPRESS_TIMESTAMPS flag in the 
PQtrace () part.
And I think PQtrace () and PQtraceSetFlag () are closely related functions. 
It would be nice for the user to know both features before using the logging 
feature. 
So I added a description of PQtraceSetFlag () to the paragraph of PQtrace ().
However, the PQtraceSetFlag () documentation is in the next paragraph of 
PQtrace (), 
so users can easily notice the PQtraceSetFlag () function. So I removed it.

> (49)
> +  Determine to output tracing with or without a timestamp to a
> debugging file stream.
> 
> The description should be the overall functionality of this function 
> considering
> the future additions of flags.  Something like:
> 
> Controls the tracing behavior of client/server communication.
> 
> +  then timestamp is not printed with each message. The default is
> output with timestamp.
> 
> The default value for flags argument (0) should be described here.
> 
> Also, it should be added that PQsetTraceFlags() can be called before or after
> the PQtrace() call.

I fixed documentation like this;
 If set to 0 (default),tracing will be output with timestamp. 
 This function should be called after calling PQtrace.

> (50)
> I'd like you to consider skipLogging again, because it seems that such a
> variable is for the logging machinery to control state transitions internally 
> in
> fe-logging.c.
> What I'm concerned is that those who modify libpq have to be aware of
> skipLogging and would fail to handle it.

To implement inLogging, logging skip method has been modified to be completed 
within the libpq-logging.c file.
The protocol message consists of First Byte, Length, and Contents.
When the first byte and length arrive, if contents is not yet in the buffer, 
the process of reading the first byte and length is repeated.
Reloading is necessary to parse the protocol message and proceed, 
but we want to skip logging because it is not necessary to output the message 
which already log to file.

To do this, remember the value of inCursor in the inLogging variable and record 
how far logging has progressed.
In the pqTraceLogBeMsgByte1() and pqTraceLogBeMsgInt(),
the value of inCursor at that time is stored in inLogging after outputting the 
log to a file.
If inCursor is smaller than inLogging, it exits the log output function without 
doing anything.
If all log messages are output, initialize the inLogging.

Changed the pqTraceMaybeBreakLine() function to return whether a line break has 
occurred 
in order to consolidate the process of recording the cursor in one place. If it 
break line,
inLogging is initialized and inCursor is not remembered inLogging.

Also, as Horiguchi san mentioned, I changed to update the value of inLogging 
even in the function where the value of inCursor is changed (Ex. 
pqCheckInBufferSpace () and pqReadData ()).


> (51)
> >> +typedef enum PGLogState
> >> +{
> >>
> >> This is libpq-logging.c internal type. It is not needed to be exposed.
> 
> > I fixed it.
> 
> How did you fix it?  The typedef is still in .h file.  It should be in .h, 
> shouldn't
> it?  Houw about the other typedefs?

I didn’t move PGLogState to .c file because PGLogState is a member of be_msg 
which is referred from other files.


Regards,
Aya Iwata
Fujitsu


v18-0001-libpq-trace.patch
Description: v18-0001-libpq-trace.patch


Re: Extensibility of the PostgreSQL wire protocol

2021-02-19 Thread Damir Simunic


> On 11 Feb 2021, at 16:06, Tom Lane  wrote:
> 
> Maybe there is some useful thing that can be accomplished here, but we
> need to consider the bigger picture rather than believing (without proof)
> that a few hook variables will be enough to do anything.
> 
>   regards, tom lane
> 

Pluggable wire protocol is a game-changer on its own. 

The bigger picture is that a right protocol choice enables large-scale 
architectural simplifications for whole classes of production applications.

For browser-based applications (lob, saas, e-commerce), having the database 
server speak the browser protocol enables architectures without backend 
application code. This in turn leads to significant reductions of latency, 
complexity, and application development time. And it’s not just lack of backend 
code: one also profits from all the existing infrastructure like per-query 
compression/format choice, browser connection management, sse, multiple 
streams, prioritization, caching/cdns, etc.

Don’t know if you’d consider it as a proof, yet I am seeing 2x to 4x latency 
reduction in production applications from protocol conversion to http/2. My 
present solution is a simple connection pooler I built on top of Nginx 
transforming the tcp stream as it passes through.

In a recent case, letting the browser talk directly to the database allowed me 
to get rid of a ~100k-sloc .net backend and all the complexity and 
infrastructure that goes with coding/testing/deploying/maintaining it, while 
keeping all the positives: per-query compression/data conversion, querying 
multiple databases over a single connection, session cookies, etc. Deployment 
is trivial compared to what was before. Latency is down 2x-4x across the board.

Having some production experience with this approach, I can see how 
http/2-speaking Postgres would further reduce latency, processing cost, and 
time-to-interaction for applications.

A similar case can be made for IoT where one would want to plug an 
iot-optimized protocol. Again, most of the benefit is possible with a 
protocol-converting proxy, but there are additional non-trivial performance 
gains to be had if the database server speaks the right protocol.

While not the only use cases, I’d venture a guess these represent a sizable 
chunk of what Postgres is used for today, and will be used even more for, so 
the positive impact of a pluggable protocol would be significant.

--
Damir



[PATCH] Present all committed transaction to the output plugin

2021-02-19 Thread Markus Wanner

Hi,

attached is a patch that I think is cleaning up the API between Postgres 
and the logical decoding plugin.  Up until now, not only transactions 
rolled back, but also some committed transactions were filtered and not 
presented to the output plugin.  While it is documented that aborted 
transactions are not decoded, the second exception has not been documented.


The difference is with committed empty transactions that have a snapshot 
versus those that do not.  I think that's arbitrary and propose to 
remove this distinction, so that all committed transactions are decoded.


In the case of decoding a two-phase transaction, I argue that this is 
even more important, as the gid potentially carries information.


Please consider the attached patch, which drops the mentioned filter. 
It also adjusts tests to show the difference and provides a minor 
clarification to the documentation.


Regards

Markus
From: Markus Wanner 
Date: Fri, 19 Feb 2021 13:34:22 +0100
Subject: Present all committed transactions to the output plugin

Drop the filtering of some empty transactions and more
consistently present all committed transactions to the output
plugin.  Changes behavior for empty transactions that do not even
have a base_snapshot.

While clearly not making any substantial changes to the database,
an empty transactions still increments the local TransactionId
and possibly carries a gid (in the case of two-phase
transactions).

Adjust the test_decoder to show the difference between empty
transactions that do not carry a snapshot and those that
do. Adjust tests to cover decoding of committed empty
transactions with and without snapshots.

Add a minor clarification in the docs.
---
 .../replication/logical/reorderbuffer.c   | 41 ---
 doc/src/sgml/logicaldecoding.sgml | 10 +++--
 contrib/test_decoding/expected/mxact.out  |  2 +-
 .../test_decoding/expected/ondisk_startup.out |  2 +-
 contrib/test_decoding/expected/prepared.out   | 33 +--
 contrib/test_decoding/expected/xact.out   | 39 ++
 contrib/test_decoding/sql/prepared.sql| 23 +--
 contrib/test_decoding/sql/xact.sql| 21 ++
 contrib/test_decoding/test_decoding.c | 12 ++
 9 files changed, 147 insertions(+), 36 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 5a62ab8bbc1..1e9ed81c3c1 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2016,7 +2016,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	ReorderBufferBuildTupleCidHash(rb, txn);
 
 	/* setup the initial snapshot */
-	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+	if (snapshot_now)
+		SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
 
 	/*
 	 * Decoding needs access to syscaches et al., which in turn use
@@ -2289,6 +2290,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	break;
 
 case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
+	Assert(snapshot_now);
+
 	/* get rid of the old */
 	TeardownHistoricSnapshot(false);
 
@@ -2321,6 +2324,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	break;
 
 case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
+	Assert(snapshot_now);
 	Assert(change->data.command_id != InvalidCommandId);
 
 	if (command_id < change->data.command_id)
@@ -2397,8 +2401,11 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 * streaming mode.
 		 */
 		if (streaming)
+		{
+			Assert(snapshot_now);
 			ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
-		else if (snapshot_now->copied)
+		}
+		else if (snapshot_now && snapshot_now->copied)
 			ReorderBufferFreeSnap(rb, snapshot_now);
 
 		/* cleanup */
@@ -2520,7 +2527,6 @@ ReorderBufferReplay(ReorderBufferTXN *txn,
 	TimestampTz commit_time,
 	RepOriginId origin_id, XLogRecPtr origin_lsn)
 {
-	Snapshot	snapshot_now;
 	CommandId	command_id = FirstCommandId;
 
 	txn->final_lsn = commit_lsn;
@@ -2544,28 +2550,15 @@ ReorderBufferReplay(ReorderBufferTXN *txn,
 	}
 
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * Process and send the changes to output plugin.
+	 *
+	 * Note that for empty transactions, txn->base_snapshot may well be
+	 * NULL. The corresponding callbacks will still be invoked, as even an
+	 * empty transaction carries information (LSN increments, the gid in
+	 * case of a two-phase transaction).  This is unlike versions prior to
+	 * 14 which optimized away transactions without a snapshot.
 	 */
-	if (txn->base_snapshot == NULL)
-	{
-		Assert(txn->ninvalidations == 0);
-
-		/*
-		 * Removing this txn before a c

Re: Some regular-expression performance hacking

2021-02-19 Thread Joel Jacobson
On Thu, Feb 18, 2021, at 19:53, Tom Lane wrote:
>(Having said that, I can't help noticing that a very large fraction
>of those usages look like, eg, "[\w\W]".  It seems to me that that's
>a very expensive and unwieldy way to spell ".".  Am I missing
>something about what that does in Javascript?)

This popular regex

^(?:\s*(<[\w\W]+>)[^>]*|#([\w-]+))$

is coming from jQuery:

// A simple way to check for HTML strings
// Prioritize #id over  to avoid XSS via location.hash (#9521)
// Strict HTML recognition (#11290: must start with <)
// Shortcut simple #id case for speed
rquickExpr = /^(?:\s*(<[\w\W]+>)[^>]*|#([\w-]+))$/,

From: https://code.jquery.com/jquery-3.5.1.js

I think this is a non-POSIX hack to match any character, including newlines,
which are not included unless the "s" flag is set.

Javascript test:

"foo\nbar".match(/(.+)/)[1];
"foo"

"foo\nbar".match(/(.+)/s)[1];
"foo
bar"

"foo\nbar".match(/([\w\W]+)/)[1];
"foo
bar"

/Joel

Re: Extensibility of the PostgreSQL wire protocol

2021-02-19 Thread Jan Wieck
Thank you Kuntal,

On Fri, Feb 19, 2021 at 4:36 AM Kuntal Ghosh 
wrote:

> On Thu, Feb 18, 2021 at 9:32 PM Jan Wieck  wrote:
>
>
> Few comments in the extension code (although experimental):
>
> 1. In telnet_srv.c,
> + static intpe_port;
> ..
> +   DefineCustomIntVariable("telnet_srv.port",
> +   "Telnet server
> port.",
> +   NULL,
> +   &pe_port,
> +   pe_port,
> +   1024,
> +   65536,
> +   PGC_POSTMASTER,
> +   0,
> +   NULL,
> +   NULL,
> +   NULL);
>
> The variable pe_port should be initialized to a value which is > 1024
> and < 65536. Otherwise, the following assert will fail,
> TRAP: FailedAssertion("newval >= conf->min", File: "guc.c", Line:
> 5541, PID: 12100)
>
>
Right, forgot to turn on Asserts.


> 2. The function pq_putbytes shouldn't be used by anyone other than
> old-style COPY out.
> +   pq_putbytes(msg, strlen(msg));
> Otherwise, the following assert will fail in the same function:
> /* Should only be called by old-style COPY OUT */
> Assert(DoingCopyOut);
>

I would argue that the Assert needs to be changed. It is obvious that the
Assert in place is meant to guard against direct usage of pg_putbytes() in
an attempt to force all code to use pq_putmessage() instead. This is good
when speaking libpq wire protocol since all messages there are prefixed
with a one byte message type. It does not apply to other protocols.

I propose to create another global boolean IsNonLibpqFrontend which the
protocol extension will set to true when accepting the connection and the
above then will change to

Assert(DoingCopyOut || IsNonLibpqFrontend);


Regards, Jan



>
> --
> Thanks & Regards,
> Kuntal Ghosh
> Amazon Web Services
>


-- 
Jan Wieck


Re: Extensibility of the PostgreSQL wire protocol

2021-02-19 Thread Heikki Linnakangas

On 19/02/2021 14:29, Damir Simunic wrote:



On 11 Feb 2021, at 16:06, Tom Lane  wrote:

Maybe there is some useful thing that can be accomplished here, but
we need to consider the bigger picture rather than believing
(without proof) that a few hook variables will be enough to do
anything.


Pluggable wire protocol is a game-changer on its own.

The bigger picture is that a right protocol choice enables
large-scale architectural simplifications for whole classes of
production applications.

For browser-based applications (lob, saas, e-commerce), having the
database server speak the browser protocol enables architectures
without backend application code. This in turn leads to significant
reductions of latency, complexity, and application development time.
And it’s not just lack of backend code: one also profits from all the
existing infrastructure like per-query compression/format choice,
browser connection management, sse, multiple streams, prioritization,
caching/cdns, etc.

Don’t know if you’d consider it as a proof, yet I am seeing 2x to 4x
latency reduction in production applications from protocol conversion
to http/2. My present solution is a simple connection pooler I built
on top of Nginx transforming the tcp stream as it passes through.


I can see value in supporting different protocols. I don't like the 
approach discussed in this thread, however.


For example, there has been discussion elsewhere about integrating 
connection pooling into the server itself. For that, you want to have a 
custom process that listens for incoming connections, and launches 
backends independently of the incoming connections. These hooks would 
not help with that.


Similarly, if you want to integrate a web server into the database 
server, you probably also want some kind of connection pooling. A 
one-to-one relationship between HTTP connections and backend processes 
doesn't seem nice.


With the hooks that exist today, would it possible to write a background 
worker that listens on a port, instead of postmaster? Can you launch 
backends from a background worker? And communicate the backend processes 
using a shared memory message queue (see pqmq.c).


I would recommend this approach: write a separate program that sits 
between the client and PostgreSQL, speaking custom protocol to the 
client, and libpq to the backend. And then move that program into a 
background worker process.



In a recent case, letting the browser talk directly to the database
allowed me to get rid of a ~100k-sloc .net backend and all the
complexity and infrastructure that goes with
coding/testing/deploying/maintaining it, while keeping all the
positives: per-query compression/data conversion, querying multiple
databases over a single connection, session cookies, etc. Deployment
is trivial compared to what was before. Latency is down 2x-4x across
the board.


Querying multiple databases over a single connection is not possible 
with the approach taken here. Not sure about the others things you listed.


- Heikki




Re: Extensibility of the PostgreSQL wire protocol

2021-02-19 Thread Jonah H. Harris
On Fri, Feb 19, 2021 at 8:48 AM Heikki Linnakangas  wrote:

> With the hooks that exist today, would it possible to write a background
> worker that listens on a port, instead of postmaster? Can you launch
> backends from a background worker? And communicate the backend processes
> using a shared memory message queue (see pqmq.c).
>

Yes. That's similar to how mine work: A background worker that acts as a
listener for the new protocol which then sets up a new dynamic background
worker on accept(), waits for its creation, passes the fd to the new
background worker, and sits in a while (!got_sigterm) loop checking the
socket for activity and running the protocol similar to postmaster. I
haven't looked at the latest connection pooling patches but, in general,
connection pooling is an abstract issue and should be usable for any type
of connection as, realistically, it's just an event loop and state problem
- it shouldn't be protocol specific.

I would recommend this approach: write a separate program that sits
> between the client and PostgreSQL, speaking custom protocol to the
> client, and libpq to the backend. And then move that program into a
> background worker process.
>

Doing protocol conversion between libpq and a different protocol works, but
is slow. My implementations were originally all proxies that worked outside
the database, then I moved them inside, then I replaced all the libpq code
with SPI-related calls.


> > In a recent case, letting the browser talk directly to the database
> > allowed me to get rid of a ~100k-sloc .net backend and all the
> > complexity and infrastructure that goes with
> > coding/testing/deploying/maintaining it, while keeping all the
> > positives: per-query compression/data conversion, querying multiple
> > databases over a single connection, session cookies, etc. Deployment
> > is trivial compared to what was before. Latency is down 2x-4x across
> > the board.
>
> Querying multiple databases over a single connection is not possible
> with the approach taken here. Not sure about the others things you listed.
>

Accessing multiple databases from the same backend is problematic overall -
I didn't solve that in my implementations either. IIRC, once a bgworker is
attached to a specific database, it's basically stuck with that database.

-- 
Jonah H. Harris


Re: repeated decoding of prepared transactions

2021-02-19 Thread Markus Wanner

Ajin, Amit,

thank you both a lot for thinking this through and even providing a patch.

The changes in expectation for twophase.out matches exactly with what I 
prepared.  And the switch with pg_logical_slot_get_changes indeed is 
something I had not yet considered, either.


On 19.02.21 03:50, Ajin Cherian wrote:

For this, I am planning to change the semantics such that
two-phase-commit can only be specified while creating the slot using
pg_create_logical_replication_slot()
and not in pg_logical_slot_get_changes, thus preventing
two-phase-commit flag from being toggled between restarts of the
decoder. Let me know if anybody objects to this
change, else I will update that in the next patch.


This sounds like a good plan to me, yes.


However, more generally speaking, I suspect you are overthinking this. 
All of the complexity arises because of the assumption that an output 
plugin receiving and confirming a PREPARE may not be able to persist 
that first phase of transaction application.  Instead, you are trying to 
somehow resurrect the transactional changes and the prepare at COMMIT 
PREPARED time and decode it in a deferred way.


Instead, I'm arguing that a PREPARE is an atomic operation just like a 
transaction's COMMIT.  The decoder should always feed these in the order 
of appearance in the WAL.  For example, if you have PREAPRE A, COMMIT B, 
COMMIT PREPARED A in the WAL, the decoder should always output these 
events in exactly that order.  And not ever COMMIT B, PREPARE A, COMMIT 
PREPARED A (which is currently violated in the expectation for 
twophase_snapshot, because the COMMIT for `s1insert` there appears after 
the PREPARE of `s2p` in the WAL, but gets decoded before it).


The patch I'm attaching corrects this expectation in twophase_snapshot, 
adds an explanatory diagram, and eliminates any danger of sending 
PREPAREs at COMMIT PREPARED time.  Thereby preserving the ordering of 
PREPAREs vs COMMITs.


Given the output plugin supports two-phase commit, I argue there must be 
a good reason for it setting the start_decoding_at LSN to a point in 
time after a PREPARE.  To me that means the output plugin (or its 
downstream replica) has processed the PREPARE (and the downstream 
replica did whatever it needed to do on its side in order to make the 
transaction ready to be committed in a second phase).


(In the weird case of an output plugin that wants to enable two-phase 
commit but does not really support it downstream, it's still possible 
for it to hold back LSN confirmations for prepared-but-still-in-flight 
transactions.  However, I'm having a hard time justifying this use case.)


With that line of thinking, the point in time (or in WAL) of the COMMIT 
PREPARED does not matter at all to reason about the decoding of the 
PREPARE operation.  Instead, there are only exactly two cases to consider:


a) the PREPARE happened before the start_decoding_at LSN and must not be 
decoded. (But the effects of the PREPARE must then be included in the 
initial synchronization. If that's not supported, the output plugin 
should not enable two-phase commit.)


b) the PREPARE happens after the start_decoding_at LSN and must be 
decoded.  (It obviously is not included in the initial synchronization 
or decoded by a previous instance of the decoder process.)


The case where the PREPARE lies before SNAPBUILD_CONSISTENT must always 
be case a) where we must not repeat the PREPARE, anyway.  And in case b) 
where we need a consistent snapshot to decode the PREPARE, existing 
provisions already guarantee that to be possible (or how would this be 
different from a regular single-phase commit?).


Please let me know what you think and whether this approach is feasible 
for you as well.


Regards

Markus
>From ed03c463175733072edf8afb8d120a1285a3194f Mon Sep 17 00:00:00 2001
From: Markus Wanner 
Date: Tue, 9 Feb 2021 16:16:13 +0100
Subject: [PATCH] Preserve ordering of PREPAREs vs COMMITs in logical decoding

Decouple decoding of the prepare phase of a two-phase transaction from
the final commit (or rollback) of a two-phase transaction, so that
these are more like atomic operations which preserve the ordering in
WAL.  And so that transactional changes of a PREPARE are not ever
provided to the output plugin unnecessarily.

Correct test expectations to expect no duplication.  Add a variant with
a ROLLBACK PREPARED to twophase_snapshot and illustrate the test case
with an explanatory diagram.
---
 contrib/test_decoding/expected/twophase.out   | 38 -
 .../expected/twophase_snapshot.out| 40 -
 .../expected/twophase_stream.out  | 28 +-
 .../specs/twophase_snapshot.spec  | 56 ---
 doc/src/sgml/logicaldecoding.sgml | 17 --
 .../replication/logical/reorderbuffer.c   | 29 --
 6 files changed, 112 insertions(+), 96 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replicatio

Re: A reloption for partitioned tables - parallel_workers

2021-02-19 Thread Seamus Abshere
hi Amit,

Thanks so much for doing this. I had created

https://commitfest.postgresql.org/32/2987/

and it looks like it now shows your patch as the one to use. Let me know if I 
should do anything else.

Best,
Seamus

On Fri, Feb 19, 2021, at 2:30 AM, Amit Langote wrote:
> On Tue, Feb 16, 2021 at 3:05 PM Amit Langote  wrote:
> > On Tue, Feb 16, 2021 at 1:35 AM Seamus Abshere  wrote:
> > > Here we go, my first patch... solves 
> > > https://www.postgresql.org/message-id/7d6fdc20-857c-4cbe-ae2e-c0ff9520e...@www.fastmail.com
> >
> > Thanks for sending the patch here.
> >
> > It seems you haven't made enough changes for reloptions code to
> > recognize parallel_workers as valid for partitioned tables, because
> > even with the patch applied, I get this:
> >
> > create table rp (a int) partition by range (a);
> > create table rp1 partition of rp for values from (minvalue) to (0);
> > create table rp2 partition of rp for values from (0) to (maxvalue);
> > alter table rp set (parallel_workers = 1);
> > ERROR:  unrecognized parameter "parallel_workers"
> >
> > You need this:
> >
> > diff --git a/src/backend/access/common/reloptions.c
> > b/src/backend/access/common/reloptions.c
> > index 029a73325e..9eb8a0c10d 100644
> > --- a/src/backend/access/common/reloptions.c
> > +++ b/src/backend/access/common/reloptions.c
> > @@ -377,7 +377,7 @@ static relopt_int intRelOpts[] =
> > {
> > "parallel_workers",
> > "Number of parallel processes that can be used per
> > executor node for this relation.",
> > -   RELOPT_KIND_HEAP,
> > +   RELOPT_KIND_HEAP | RELOPT_KIND_PARTITIONED,
> > ShareUpdateExclusiveLock
> > },
> > -1, 0, 1024
> >
> > which tells reloptions parsing code that parallel_workers is
> > acceptable for both heap and partitioned relations.
> >
> > Other comments on the patch:
> >
> > * This function calling style, where the first argument is not placed
> > on the same line as the function itself, is not very common in
> > Postgres:
> >
> > +   /* First see if there is a root-level setting for parallel_workers 
> > */
> > +   parallel_workers = compute_parallel_worker(
> > +   rel,
> > +   -1,
> > +   -1,
> > +   max_parallel_workers_per_gather
> > +
> >
> > This makes the new code look very different from the rest of the
> > codebase.  Better to stick to existing styles.
> >
> > 2. It might be a good idea to use this opportunity to add a function,
> > say compute_append_parallel_workers(), for the code that does what the
> > function name says.  Then the patch will simply add the new
> > compute_parallel_worker() call at the top of that function.
> >
> > 3. I think we should consider the Append parent relation's
> > parallel_workers ONLY if it is a partitioned relation, because it
> > doesn't make a lot of sense for other types of parent relations.  So
> > the new code should look like this:
> >
> > if (IS_PARTITIONED_REL(rel))
> > parallel_workers = compute_parallel_worker(rel, -1, -1,
> > max_parallel_workers_per_gather);
> >
> > 4. Maybe it also doesn't make sense to consider the parent relation's
> > parallel_workers if Parallel Append is disabled
> > (enable_parallel_append = off).  That's because with a simple
> > (non-parallel) Append running under Gather, all launched parallel
> > workers process the same partition before moving to the next one.
> > OTOH, one's intention of setting parallel_workers on the parent
> > partitioned table would most likely be to distribute workers across
> > partitions, which is only possible with parallel Append
> > (enable_parallel_append = on).  So, the new code should look like
> > this:
> >
> > if (IS_PARTITIONED_REL(rel) && enable_parallel_append)
> > parallel_workers = compute_parallel_worker(rel, -1, -1,
> > max_parallel_workers_per_gather);
> 
> Here is an updated version of the Seamus' patch that takes into
> account these and other comments received on this thread so far.
> Maybe warrants adding some tests too but I haven't.
> 
> Seamus, please register this patch in the next commit-fest:
> https://commitfest.postgresql.org/32/
> 
> If you haven't already, you will need to create a community account to
> use that site.
> 
> -- 
> Amit Langote
> EDB: http://www.enterprisedb.com
> 
> Attachments:
> * v2-0001-Allow-setting-parallel_workers-on-partitioned-tab.patch




Re: Finding cause of test fails on the cfbot site

2021-02-19 Thread Andrew Dunstan

On 2/18/21 11:01 AM, Andrew Dunstan wrote:
> On 2/17/21 3:42 PM, Thomas Munro wrote:
>> On Thu, Feb 18, 2021 at 9:18 AM Andrew Dunstan  wrote:
>>> On 2/17/21 11:06 AM, Tom Lane wrote:
 Peter Smith  writes:
> I saw that one of our commitfest entries (32/2914) is recently
> reporting a fail on the cfbot site [1]. I thought this was all ok a
> few days ago.
> ...
> Is there any other detailed information available anywhere, e.g.
> logs?, which might help us work out what was the cause of the test
> failure?
 AFAIK the cfbot doesn't capture anything beyond the session typescript.
 However, this doesn't look that hard to reproduce locally ... have you
 tried, using similar configure options to what that cfbot run did?
 Once you did reproduce it, there'd be logs under
 contrib/test_decoding/tmp_check/.
>>> yeah. The cfbot runs check-world which makes it difficult for it to know
>>> which log files to show when there's an error. That's a major part of
>>> the reason the buildfarm runs a much finer grained set of steps.
>> Yeah, it's hard to make it print out just the right logs without
>> dumping so much stuff that it's hard to see the wood for the trees;
>> perhaps if the Makefile had an option to dump relevant stuff for the
>> specific tests that failed, or perhaps the buildfarm is already better
>> at that and cfbot should just use the buildfarm client directly.  Hmm.
>> Another idea would be to figure out how to make a tarball of all log
>> files that you can download for inspection with better tools at home
>> when things go wrong.  It would rapidly blow through the 1GB limit for
>> stored "artefacts" on open source/community Cirrus accounts though, so
>> we'd need to figure out how to manage retention carefully.
>
> I did some thinking about this. How about if we have the make files and
> the msvc build system create a well known file with the location(s) to
> search for log files if there's an error. Each bit of testing could
> overwrite this file before starting testing, and then tools like cfbot
> would know where to look for files to report? To keep things clean for
> other users the file would only be created if, say,
> PG_NEED_ERROR_LOG_LOCATIONS is set. The well known location would be
> something like "$(top_builddir)/error_log_locations.txt", and individual
> Makefiles would have entries something like:,
>
>
> override ERROR_LOG_LOCATIONS =
> $(top_builddir)/contrib/test_decoding/tmp_check/log
>
>
> If this seems like a good idea I can go and try to make that happen.
>
>

here's a very small and simple (and possibly naive)  POC patch that
demonstrates this and seems to do the right thing.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 74b3a6acd2..a1339d6ec9 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -390,6 +390,15 @@ endif
 endif
 
 
+.PHONY: set_error_log_locations
+
+set_error_log_locations:
+ifdef PG_NEED_ERROR_LOCATIONS
+ifdef ERROR_LOG_LOCATIONS
+	echo "$(ERROR_LOG_LOCATIONS)" > $(top_builddir)/error_log_locations.txt
+endif
+endif
+
 # Testing
 
 # In much the same way as above, these rules ensure that we build a temp
diff --git a/src/bin/psql/Makefile b/src/bin/psql/Makefile
index d00881163c..c325d76b52 100644
--- a/src/bin/psql/Makefile
+++ b/src/bin/psql/Makefile
@@ -12,6 +12,8 @@
 PGFILEDESC = "psql - the PostgreSQL interactive terminal"
 PGAPPICON=win32
 
+override ERROR_LOG_LOCATIONS = $(abs_top_builddir)/$(subdir)/tmp_check/log
+
 subdir = src/bin/psql
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
@@ -83,7 +85,7 @@ clean distclean:
 maintainer-clean: distclean
 	rm -f sql_help.h sql_help.c psqlscanslash.c
 
-check:
+check: set_error_log_locations
 	$(prove_check)
 
 installcheck:
diff --git a/src/test/regress/GNUmakefile b/src/test/regress/GNUmakefile
index 3d85857bfa..5fbcb5e34d 100644
--- a/src/test/regress/GNUmakefile
+++ b/src/test/regress/GNUmakefile
@@ -13,6 +13,8 @@
 PGFILEDESC = "pg_regress - test driver"
 PGAPPICON = win32
 
+override ERROR_LOG_LOCATIONS = $(abs_top_builddir)/$(subdir)/log
+
 subdir = src/test/regress
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
@@ -128,10 +130,10 @@ tablespace-setup:
 
 REGRESS_OPTS = --dlpath=. --max-concurrent-tests=20 $(EXTRA_REGRESS_OPTS)
 
-check: all tablespace-setup
+check: all set_error_log_locations tablespace-setup
 	$(pg_regress_check) $(REGRESS_OPTS) --schedule=$(srcdir)/parallel_schedule $(MAXCONNOPT) $(EXTRA_TESTS)
 
-check-tests: all tablespace-setup | temp-install
+check-tests: all set_error_log_locations tablespace-setup | temp-install
 	$(pg_regress_check) $(REGRESS_OPTS) $(MAXCONNOPT) $(TESTS) $(EXTRA_TESTS)
 
 installcheck: all tablespace-setup


Re: pg_config_h.in not up-to-date

2021-02-19 Thread Tom Lane
Michael Paquier  writes:
> On Fri, Feb 19, 2021 at 02:21:21AM -0500, Tom Lane wrote:
>> +1, but I think the first period in this comment is redundant:
>> +  AC_DEFINE([USE_OPENSSL], 1, [Define to 1 to build with OpenSSL support. 
>> (--with-ssl=openssl).])

> I guess that you mean the second period here to be more consistent
> with the others?  That would mean the following diff:
> +  AC_DEFINE([USE_OPENSSL], 1, [Define to 1 to build with OpenSSL support. 
> (--with-ssl=openssl)])

Hm.  It should be consistent with the rest, for sure.  Personally I'd put
the only period at the end, but I suppose we should stick with the
prevailing style if there is one.

regards, tom lane




Re: Extensibility of the PostgreSQL wire protocol

2021-02-19 Thread Jan Wieck

On 2/19/21 8:48 AM, Heikki Linnakangas wrote:

I can see value in supporting different protocols. I don't like the
approach discussed in this thread, however.

For example, there has been discussion elsewhere about integrating
connection pooling into the server itself. For that, you want to have a
custom process that listens for incoming connections, and launches
backends independently of the incoming connections. These hooks would
not help with that.


The two are not mutually exclusive. You are right that the current 
proposal would not help with that type of built in connection pool, but 
it may be extended to that.


Give the function, that postmaster is calling to accept a connection 
when a server_fd is ready, a return code that it can use to tell 
postmaster "forget about it, don't fork or do anything else with it". 
This function is normally calling StreamConnection() before the 
postmaster then forks the backend. But it could instead hand over the 
socket to the pool background worker (I presume Jonah is transferring 
them from process to process via UDP packet). The pool worker is then 
launching the actual backends which receive a requesting client via the 
same socket transfer to perform one or more transactions, then hand the 
socket back to the pool worker.


All of that would still require a protocol extension that has special 
messages for "here is a client socket for you" and "you can have that 
back".




I would recommend this approach: write a separate program that sits
between the client and PostgreSQL, speaking custom protocol to the
client, and libpq to the backend. And then move that program into a
background worker process.


That is a classic protocol converting proxy. It has been done in the 
past with not really good results, both performance wise as with respect 
to protocol completeness.



Regards, Jan

--
Jan Wieck
Principle Database Engineer
Amazon Web Services




Re: Problem with accessing TOAST data in stored procedures

2021-02-19 Thread Konstantin Knizhnik



On 19.02.2021 11:12, Pavel Stehule wrote:



pá 19. 2. 2021 v 9:08 odesílatel Konstantin Knizhnik 
mailto:k.knizh...@postgrespro.ru>> napsal:




On 19.02.2021 10:47, Pavel Stehule wrote:



pá 19. 2. 2021 v 8:39 odesílatel Konstantin Knizhnik
mailto:k.knizh...@postgrespro.ru>>
napsal:



On 19.02.2021 10:14, Pavel Stehule wrote:



pá 19. 2. 2021 v 7:51 odesílatel Konstantin Knizhnik
mailto:k.knizh...@postgrespro.ru>> napsal:



On 18.02.2021 20:10, Pavel Stehule wrote:

This has a negative impact on performance - and a lot
of users use procedures without transaction control. So
it doesn't look like a good solution.

I am more concentrated on the Pg 14 release, where the
work with SPI is redesigned, and I hope so this issue
is fixed there. For older releases, I don't know. Is
this issue related to Postgres or it is related to
PgPro only? If it is related to community pg, then we
should fix and we should accept not too good
performance, because there is no better non invasive
solution. If it is PgPro issue (because there are ATX
support) you can fix it (or you can try backport the
patch

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=ee895a655ce4341546facd6f23e3e8f2931b96bf
). You have more possibilities on PgPro code base.


Sorry, it is not PgPro specific problem and recent
master suffers from this bug as well.
In the original bug report there was simple scenario of
reproducing the problem:

CREATE TABLE toasted(id serial primary key, data text);
INSERT INTO toasted(data) VALUES((SELECT
string_agg(random()::text,':') FROM generate_series(1,
1000)));
INSERT INTO toasted(data) VALUES((SELECT
string_agg(random()::text,':') FROM generate_series(1,
1000)));
DO $$ DECLARE v_r record; BEGIN FOR v_r in SELECT data
FROM toasted LOOP INSERT INTO toasted(data)
VALUES(v_r.data);COMMIT;END LOOP;END;$$;


can you use new procedure_resowner?


Sorry, I do not understanf your suggestion.
How procedure_resowner can help to solve this problem?


This is just an idea - I think the most correct with zero
performance impact is keeping snapshot, and this can be stored in
procedure_resowner.

The fundamental question is if we want or allow more snapshots
per query. The implementation is a secondary issue.


I wonder if it is correct from logical point of view.
If we commit transaction in stored procedure, then we actually
implicitly start new transaction.
And new transaction should have new snapshot. Otherwise its
behavior will change.


I have no problem with this. I have a problem with cycle 
implementation - when I iterate over some result, then this result 
should be consistent over all cycles.  In other cases, the behaviour 
is not deterministic.


I have investigated more the problem with toast data in  stored 
procedures and come to very strange conclusion:
to fix the problem it is enough to pass expand_external=false to 
expanded_record_set_tuple instead of !estate->atomic:


    {
    /* Only need to assign a new 
tuple value */

expanded_record_set_tuple(rec->erh, tuptab->vals[i],
- true, !estate->atomic);
+ true, false);
    }

Why it is correct?
Because in assign_simple_var we already forced detoasting for data:

    /*
     * In non-atomic contexts, we do not want to store TOAST pointers in
     * variables, because such pointers might become stale after a commit.
     * Forcibly detoast in such cases.  We don't want to detoast (flatten)
     * expanded objects, however; those should be OK across a transaction
     * boundary since they're just memory-resident objects. (Elsewhere in
     * this module, operations on expanded records likewise need to request
     * detoasting of record fields when !estate->atomic. Expanded 
arrays are

     * not a problem since all array entries are always detoasted.)
     */
    if (!estate->atomic && !isnull && var->datatype->typlen == -1 &&
        VARATT_IS_EXTERNAL_NON_EXPANDED(DatumGetPointer(newvalue)))
    {
        MemoryContext oldcxt;
        Datum        detoasted;

        /*
         * Do the detoasting in the eval_mcontext to avoid long-term 
leakage
         * of whatever memory toast fetching might leak.  Then we have 
to copy

         * the detoasted datum to the function's main context, which is a
         * pain, but there's little choice.
         */
        oldcxt = MemoryContextSwitchTo(get_eval_mcontext(estate));
        detoasted = PointerGetDatum(detoast_external_attr((struct 
varlena *

Re: Some regular-expression performance hacking

2021-02-19 Thread Tom Lane
"Joel Jacobson"  writes:
> On Thu, Feb 18, 2021, at 19:53, Tom Lane wrote:
>> (Having said that, I can't help noticing that a very large fraction
>> of those usages look like, eg, "[\w\W]".  It seems to me that that's
>> a very expensive and unwieldy way to spell ".".  Am I missing
>> something about what that does in Javascript?)

> I think this is a non-POSIX hack to match any character, including newlines,
> which are not included unless the "s" flag is set.

> "foo\nbar".match(/([\w\W]+)/)[1];
> "foo
> bar"

Oooh, that's very interesting.   I guess the advantage of that over using
the 's' flag is that you can have different behaviors at different places
in the same regex.

I was just wondering about this last night in fact, while hacking on
the code to get it to accept \W etc in bracket expressions.  I see that
right now, our code thinks that NLSTOP mode ('n' switch, the opposite
of 's') should cause \W \D \S to not match newline.  That seems a little
weird, not least because \S should probably be different from the other
two, and it isn't.  And now we see it'd mean that you couldn't use the 'n'
switch to duplicate Javascript's default behavior in this area.  Should we
change it?  (I wonder what Perl does.)

regards, tom lane




Re: Problem with accessing TOAST data in stored procedures

2021-02-19 Thread Pavel Stehule
pá 19. 2. 2021 v 16:19 odesílatel Konstantin Knizhnik <
k.knizh...@postgrespro.ru> napsal:

>
>
> On 19.02.2021 11:12, Pavel Stehule wrote:
>
>
>
> pá 19. 2. 2021 v 9:08 odesílatel Konstantin Knizhnik <
> k.knizh...@postgrespro.ru> napsal:
>
>>
>>
>> On 19.02.2021 10:47, Pavel Stehule wrote:
>>
>>
>>
>> pá 19. 2. 2021 v 8:39 odesílatel Konstantin Knizhnik <
>> k.knizh...@postgrespro.ru> napsal:
>>
>>>
>>>
>>> On 19.02.2021 10:14, Pavel Stehule wrote:
>>>
>>>
>>>
>>> pá 19. 2. 2021 v 7:51 odesílatel Konstantin Knizhnik <
>>> k.knizh...@postgrespro.ru> napsal:
>>>


 On 18.02.2021 20:10, Pavel Stehule wrote:

 This has a negative impact on performance - and a lot of users use
 procedures without transaction control. So it doesn't look like a good
 solution.

 I am more concentrated on the Pg 14 release, where the work with SPI is
 redesigned, and I hope so this issue is fixed there. For older releases, I
 don't know. Is this issue related to Postgres or it is related to PgPro
 only? If it is related to community pg, then we should fix and we should
 accept not too good performance, because there is no better non invasive
 solution. If it is PgPro issue (because there are ATX support) you can fix
 it (or you can try backport the patch
 https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=ee895a655ce4341546facd6f23e3e8f2931b96bf
 ). You have more possibilities on PgPro code base.


 Sorry, it is not PgPro specific problem and recent master suffers from
 this bug as well.
 In the original bug report there was simple scenario of reproducing the
 problem:

 CREATE TABLE toasted(id serial primary key, data text);
 INSERT INTO toasted(data) VALUES((SELECT string_agg(random()::text,':')
 FROM generate_series(1, 1000)));
 INSERT INTO toasted(data) VALUES((SELECT string_agg(random()::text,':')
 FROM generate_series(1, 1000)));
 DO $$ DECLARE v_r record; BEGIN FOR v_r in SELECT data FROM toasted
 LOOP INSERT INTO toasted(data) VALUES(v_r.data);COMMIT;END LOOP;END;$$;

>>>
>>> can you use new procedure_resowner?
>>>
>>> Sorry, I do not understanf your suggestion.
>>> How procedure_resowner can help to solve this problem?
>>>
>>
>> This is just an idea - I think the most correct with zero performance
>> impact is keeping snapshot, and this can be stored in procedure_resowner.
>>
>> The fundamental question is if we want or allow more snapshots per query.
>> The implementation is a secondary issue.
>>
>>
>> I wonder if it is correct from logical point of view.
>> If we commit transaction in stored procedure, then we actually implicitly
>> start new transaction.
>> And new transaction should have new snapshot. Otherwise its behavior will
>> change.
>>
>
> I have no problem with this. I have a problem with cycle implementation -
> when I iterate over some result, then this result should be consistent over
> all cycles.  In other cases, the behaviour is not deterministic.
>
>
> I have investigated more the problem with toast data in  stored procedures
> and come to very strange conclusion:
> to fix the problem it is enough to pass expand_external=false to
> expanded_record_set_tuple instead of !estate->atomic:
>
> {
> /* Only need to assign a new tuple
> value */
>
> expanded_record_set_tuple(rec->erh, tuptab->vals[i],
> -
> true, !estate->atomic);
> +
> true, false);
> }
>
> Why it is correct?
> Because in assign_simple_var we already forced detoasting for data:
>
> /*
>  * In non-atomic contexts, we do not want to store TOAST pointers in
>  * variables, because such pointers might become stale after a commit.
>  * Forcibly detoast in such cases.  We don't want to detoast (flatten)
>  * expanded objects, however; those should be OK across a transaction
>  * boundary since they're just memory-resident objects.  (Elsewhere in
>  * this module, operations on expanded records likewise need to request
>  * detoasting of record fields when !estate->atomic.  Expanded arrays
> are
>  * not a problem since all array entries are always detoasted.)
>  */
> if (!estate->atomic && !isnull && var->datatype->typlen == -1 &&
> VARATT_IS_EXTERNAL_NON_EXPANDED(DatumGetPointer(newvalue)))
> {
> MemoryContext oldcxt;
> Datumdetoasted;
>
> /*
>  * Do the detoasting in the eval_mcontext to avoid long-term
> leakage
>  * of whatever memory toast fetching might leak.  Then we have to
> copy
>  * the detoasted datum to the function's main context, which is a
>  * pain, but there's little choice.
>  */
> oldcxt = MemoryContextSwitchTo(get_eval_mcontext(estate));
> detoasted = PointerGetDatum(detoast_external_attr((struct varlena
> *) DatumGetPointer(newvalue)));
>
>

Re: PATCH: Attempt to make dbsize a bit more consistent

2021-02-19 Thread gkokolatos





‐‐‐ Original Message ‐‐‐
On Monday, February 1, 2021 1:18 PM, Masahiko Sawada  
wrote:

> On Thu, Nov 12, 2020 at 2:54 AM Soumyadeep Chakraborty
> soumyadeep2...@gmail.com wrote:
>
> > Hey Georgios,
> > On Tue, Nov 10, 2020 at 6:20 AM gkokola...@pm.me wrote:
> >
> > > ‐‐‐ Original Message ‐‐‐
> > > On Monday, November 9, 2020 7:50 PM, Soumyadeep Chakraborty 
> > > soumyadeep2...@gmail.com wrote:
> > >
> > > > Hey Georgios,
> > > > Thanks for looking for more avenues to invoke tableAM APIS! Please find
> > > > my review below:
> > >
> > > A great review Soumyadeep, it is much appreciated.
> > > Please remember to add yourself as a reviewer in the commitfest
> > > [https://commitfest.postgresql.org/30/2701/]
> >
> > Ah yes. Sorry, I forgot that!
> >
> > > > On Tue, Oct 13, 2020 at 6:28 AM gkokola...@pm.me wrote:
> > > >
> > > > 1.
> > > >
> > > > > /*
> > > > >
> > > > > -   -   heap size, including FSM and VM
> > > > > -   -   table size, including FSM and VM
> > > > > */
> > > > >
> > > >
> > > > We should not mention FSM and VM in dbsize.c at all as these are
> > > > heap AM specific. We can say:
> > > > table size, excluding TOAST relation
> > >
> > > Yeah, I was thinking that the notion that FSM and VM are still taken
> > > into account should be stated. We are iterating over ForkNumber
> > > after all.
> > > How about phrasing it as:
> > >
> > > -   table size, including all implemented forks from the AM (e.g. FSM, VM)
> > > -   excluding TOAST relations
> > >
> > > Thoughts?
> >
> > Yes, I was thinking along the same lines. The concept of a "fork" forced
> > should not be forced down into the tableAM. But that is a discussion for
> > another day. We can probably say:
> >
> > -   table size, including all implemented forks from the AM (e.g. FSM, VM
> > -   for the heap AM) excluding TOAST relations
> >
> > > > 2.
> > > >
> > > > > /*
> > > > >
> > > > > -   Size of toast relation
> > > > > */
> > > > > if (OidIsValid(rel->rd_rel->reltoastrelid))
> > > > >
> > > > > -   size += calculate_toast_table_size(rel->rd_rel->reltoastrelid);
> > > > >
> > > > > -   {
> > > > >
> > > > > -   Relation toastRel;
> > > > >
> > > > > -
> > > > > -   toastRel = relation_open(rel->rd_rel->reltoastrelid, 
> > > > > AccessShareLock);
> > > > >
> > > >
> > > > We can replace the OidIsValid check with relation_needs_toast_table()
> > > > and then have the OidIsValid() check in an Assert. Perhaps, that will
> > > > make things more readable.
> > >
> > > Please, allow me to kindly disagree.
> > > Relation is already open at this stage. Even create_toast_table(), the
> > > internal workhorse for creating toast relations, does check reltoastrelid
> > > to test if the relation is already toasted.
> > > Furthermore, we do call:
> > >
> > > -   toastRel = relation_open(rel->rd_rel->reltoastrelid, AccessShareLock);
> > >
> > > and in order to avoid elog() errors underneath us, we ought to have
> > > verified the validity of reltoastrelid.
> > > In short, I think that the code in the proposal is not too unreadable
> > > nor that it breaks the coding patterns throughout the codebase.
> > > Am I too wrong?
> >
> > No not at all. The code in the proposal is indeed adhering to the
> > codebase. What I was going for here was to increase the usage of
> > relation_needs_toast_table(). What I meant was:
> > if (table_relation_needs_toast_table(rel))
> > {
> > if (!OidIsValid(rel->rd_rel->reltoastrelid))
> > {
> > elog(ERROR, );
> > }
> > //size calculation here..
> > }
> > We want to error out if the toast table can't be found and the relation
> > is expected to have one, which the existing code doesn't handle.
> >
> > > > 3.
> > > >
> > > > > -   if (rel->rd_rel->relkind == RELKIND_RELATION ||
> > > > > -   rel->rd_rel->relkind == RELKIND_TOASTVALUE ||
> > > > > -   rel->rd_rel->relkind == RELKIND_MATVIEW)
> > > > > -   size = calculate_table_size(rel);
> > > > > -   else
> > > > > -   {
> > > > > -   relation_close(rel, AccessShareLock);
> > > > > -   PG_RETURN_NULL();
> > > > > -   }
> > > >
> > > > This leads to behavioral changes:
> > > > I am talking about the case where one calls pg_table_size() on an index.
> > > > W/o your patch, it returns the size of the index. W/ your patch it
> > > > returns NULL. Judging by the documentation, this function should not
> > > > ideally apply to indexes but it does! I have a sinking feeling that lots
> > > > of users use it for this purpose, as there is no function to calculate
> > > > the size of a single specific index (except pg_relation_size()).
> > > > The same argument I have made above applies to sequences. Users may have
> > > > trial-and-errored their way into finding that pg_table_size() can tell
> > > > them the size of a specific index/sequence! I don't know how widespread
> > > > the use is in the user community, so IMO maybe we should be conservative
> > > > and not introduce this change? Alternatively, we could call out that
> > > > pg_table_siz

Re: [HACKERS] Custom compression methods

2021-02-19 Thread Dilip Kumar
On Thu, Feb 11, 2021 at 1:37 AM Robert Haas  wrote:
>
> On Wed, Feb 10, 2021 at 9:52 AM Dilip Kumar  wrote:
> > [ new patches ]
>
> I think that in both varattrib_4b and toast_internals.h it would be
> better to pick a less generic field name. In toast_internals.h it's
> just info; in postgres.h it's va_info. But:
>
> [rhaas pgsql]$ git grep info | wc -l
>24552
>
> There are no references in the current source tree to va_info, so at
> least that one is greppable, but it's still not very descriptive. I
> suggest info -> tcinfo and va_info -> va_tcinfo, where "tc" stands for
> "TOAST compression". Looking through 24552 references to info to find
> the ones that pertain to this feature might take longer than searching
> the somewhat shorter list of references to tcinfo, which prepatch is
> just:
>
> [rhaas pgsql]$ git grep tcinfo | wc -l
>0

Done as suggested

>
> I don't see why we should allow for datum_decompress to be optional,
> as toast_decompress_datum_slice does. Likely every serious compression
> method will support that anyway. If not, the compression AM can deal
> with the problem, rather than having the core code do it. That will
> save some tiny amount of performance, too.

Done

> src/backend/access/compression/Makefile is missing a copyright header.

Fixed

> It's really sad that lz4_cmdecompress_slice allocates
> VARRAWSIZE_4B_C(value) + VARHDRSZ rather than slicelength + VARHDRSZ
> as pglz_cmdecompress_slice() does. Is that a mistake, or is that
> necessary for some reason? If it's a mistake, let's fix it. If it's
> necessary, let's add a comment about why, probably starting with
> "Unfortunately, ".

In older versions of the lz4 there was a problem that the decompressed
data size could be bigger than the slicelength which is resolved now
so we can allocate slicelength + VARHDRSZ, I have fixed it.

Please refer the latest patch at
https://www.postgresql.org/message-id/CAFiTN-u2pyXDDDwZXJ-fVUwbLhJSe9TbrVR6rfW_rhdyL1A5bg%40mail.gmail.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: PoC Refactor AM analyse API

2021-02-19 Thread Zhihong Yu
Denis:
Thanks for considering my suggestion.

For #1, I didn't take your example into account. Thanks for pointing that
out.

Cheers

On Thu, Feb 18, 2021 at 11:59 PM Denis Smirnov  wrote:

> Hello,  Zhihong.
>
> Thanks for your comments.
>
> 1. I am afraid that an equivalence of "floor(val + 0.5)" to "cell(val)" is
> incorrect: "floor(2.1 + 0.5)" returns 2 while  "cell(2.1)" returns 3. We
> can’t use such replacement, as you have suggested.
>
> 2. >> For compare_rows(), it seems the computation of oa and ob can be
> delayed to when ba == bb (after the first two if statements).
> I have checked some examples of ASM code generated by different compilers
> with -O2/O3 flags on https://gcc.godbolt.org and didn’t see any big
> benefit in result CPU instructions. You can check yourself an attached
> example below.
>
>
>
> Best regards,
> Denis Smirnov | Developer
> s...@arenadata.io
> Arenadata | Godovikova 9-17, Moscow 129085 Russia
>
>


Re: [HACKERS] Custom compression methods

2021-02-19 Thread Dilip Kumar
On Thu, Feb 11, 2021 at 3:26 AM Robert Haas  wrote:
>
> In CompareCompressionMethodAndDecompress, I think this is still
> playing a bit fast and loose with the rules around slots. I think we
> can do better. Suppose that at the point where we discover that we
> need to decompress at least one attribute, we create the new slot
> right then, and also memcpy tts_values and tts_isnull. Then, for that
> attribute and any future attributes that need decompression, we reset
> tts_values in the *new* slot, leaving the old one untouched. Then,
> after finishing all the attributes, the if (decompressed_any) block,
> you just have a lot less stuff to do. The advantage of this is that
> you haven't tainted the old slot; it's still got whatever contents it
> had before, and is in a clean state, which seems better to me.

Fixed

>
> It's unclear to me whether this function actually needs to
> ExecMaterializeSlot(newslot). It definitely does need to
> ExecStoreVirtualTuple(newslot) and I think it's a very good idea, if
> not absolutely mandatory, for it not to modify anything about the old
> slot. But what's the argument that the new slot needs to be
> materialized at this point? It may be needed, if the old slot would've
> had to be materialized at this point. But it's something to think
> about.

I think if the original slot was materialized then materialing the new
slot make more sense to me so done that way.

>
> The CREATE TABLE documentation says that COMPRESSION is a kind of
> column constraint, but that's wrong. For example, you can't write
> CREATE TABLE a (b int4 CONSTRAINT thunk COMPRESSION lz4), for example,
> contrary to what the syntax summary implies. When you fix this so that
> the documentation matches the grammar change, you may also need to
> move the longer description further up in create_table.sgml so the
> order matches.

Fixed

> The use of VARHDRSZ_COMPRESS in toast_get_compression_oid() appears to
> be incorrect. VARHDRSZ_COMPRESS is offsetof(varattrib_4b,
> va_compressed.va_data). But what gets externalized in the case of a
> compressed datum is just VARDATA(dval), which excludes the length
> word, unlike VARHDRSZ_COMPRESS, which does not. This has no
> consequences since we're only going to fetch 1 chunk either way, but I
> think we should make it correct.

Fixed

> TOAST_COMPRESS_SET_SIZE_AND_METHOD() could Assert something about cm_method.

While replying to the comments, I realised that I have missed it.  I
will fix it in the next version.

> Small delta patch with a few other suggested changes attached.

Merged

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Extensibility of the PostgreSQL wire protocol

2021-02-19 Thread Damir Simunic


> On 19 Feb 2021, at 14:48, Heikki Linnakangas  wrote:
> 
> For example, there has been discussion elsewhere about integrating connection 
> pooling into the server itself. For that, you want to have a custom process 
> that listens for incoming connections, and launches backends independently of 
> the incoming connections. These hooks would not help with that.
> 

Not clear how the connection polling in the core is linked to discussing 
pluggable wire protocols. 

> Similarly, if you want to integrate a web server into the database server, 
> you probably also want some kind of connection pooling. A one-to-one 
> relationship between HTTP connections and backend processes doesn't seem nice.
> 

HTTP/2 is just a protocol, not unlike fe/be that has a one-to-one relationship 
to backend processes as it stands. It shuttles data back and forth in 
query/response exchanges, and happens to be used by web servers and web 
browsers, among other things. My mentioning of it was simply an example I can 
speak of from experience, as opposed to speculating. Could have brought up any 
other wire protocol if I had experience with it, say MQTT.

To make it clear, “a pluggable wire protocol” as discussed here is a set of 
rules that defines how data is transmitted: what the requests and responses 
are, and how is the data laid out on the wire, what to do in case of error, 
etc. Nothing to do with a web server; why would one want to integrate it in the 
database, anyway?

The intended contribution to the discussion of big picture of pluggable wire 
protocols is that there are significant use cases where the protocol choice is 
restricted on the client side, and allowing a pluggable wire protocol on the 
server side brings tangible benefits in performance and architectural 
simplification. That’s all. The rest were supporting facts that hopefully can 
also serve as a counterpoint to “pluggable wire protocol is primarily useful to 
make Postgres pretend to be Mysql."

Protocol conversion HTTP/2<—>FE/BE on the connection pooler already brings a 
lot of the mentioned benefits, and I’m satisfied with it. Beyond that I’m 
simply supporting the idea of  pluggable protocols as experience so far allows 
me to see advantages that might sound theoretical to someone who never tried 
this scenario in production.

Glad to offer a couple of examples where I see potential for performance gains 
for having such a wire protocol pluggable in the core. Let me know if you want 
me to elaborate.

> Querying multiple databases over a single connection is not possible with the 
> approach taken here. 

Indeed, querying multiple databases over a single connection is something you 
need a proxy for and a different client protocol from fe/be. No need to mix 
that with the talk about pluggable wire protocol. 

My mentioning of it was in the sense “a lot of LoB backend code is nothing more 
than a bloated protocol converter that happens to also allow connecting to 
multiple databases from a single client connection => letting the client speak 
to the database [trough a proxy in this case] removed the bloated source of 
latency but kept the advantages.”

--
Damir





Re: Extensibility of the PostgreSQL wire protocol

2021-02-19 Thread Jan Wieck

On 2/19/21 12:18 PM, Damir Simunic wrote:



On 19 Feb 2021, at 14:48, Heikki Linnakangas  wrote:

For example, there has been discussion elsewhere about integrating connection 
pooling into the server itself. For that, you want to have a custom process 
that listens for incoming connections, and launches backends independently of 
the incoming connections. These hooks would not help with that.



Not clear how the connection polling in the core is linked to discussing 
pluggable wire protocols.


It isn't per se. But there are things pluggable wire protocols can help 
with in regards to connection pooling. For example a connection pool 
like pgbouncer can be configured to switch client-backend association on 
a transaction level. It therefore scans the traffic for the in 
transaction state. This however only works if an application uses 
identical session states across all connections in a pool. The JDBC 
driver for example only really prepares PreparedStatements after a 
number of executions and then assigns a name based on a counter to them. 
So it is neither guaranteed that a certain backend has the same 
statements prepared, nor that they are named the same. Therefore JDBC 
based applications cannot use PreparedStatements through pgbouncer in 
transaction mode.


An "extended" libpq protocol could allow the pool to give clients a 
unique ID. The protocol handler would then maintain maps with the SQL of 
prepared statements and what the client thinks their prepared statement 
name is. So when a client sends a P packet, the protocol handler would 
lookup the mapping and see if it already has that statement prepared. 
Just add the mapping info or actually create a new statement entry in 
the maps. These maps are of course shared across backends. So if then 
another client sends bind+execute and the backend doesn't have a plan 
for that query, it would internally create one.


There are security implications here, so things like the search path 
might have to be part of the maps, but those are implementation details.


At the end this would allow a project like pgbouncer to create an 
extended version of libpq protocol that caters to the very special needs 
of that pool.


Most of that would of course be possible on the pool side itself. But 
the internal structure of pgbouncer isn't suitable for that. It is very 
lightweight and for long SQL queries may never have the complete 'P' 
message in memory. It would also not have direct access to security 
related information like the search path, which would require extra 
round trips between the pool and the backend to retrieve it.


So while not suitable to create a built in pool by itself, loadable wire 
protocols can definitely help with connection pooling.


I also am not sure if building a connection pool into a background 
worker or postmaster is a good idea to begin with. One of the important 
features of a pool is to be able to suspend traffic and make the server 
completely idle to for example be able to restart the postmaster without 
forcibly disconnecting all clients. A pool built into a background 
worker cannot do that.



Regards, Jan

--
Jan Wieck
Principle Database Engineer
Amazon Web Services




Re: PATCH: Batch/pipelining support for libpq

2021-02-19 Thread Alvaro Herrera
Hello, thanks for looking at this patch.

On 2021-Feb-16, Zhihong Yu wrote:

> +   if (querymode == QUERY_SIMPLE)
> +   {
> +   commandFailed(st, "startpipeline", "cannot use pipeline mode
> with the simple query protocol");
> +   st->state = CSTATE_ABORTED;
> +   return CSTATE_ABORTED;
> 
> I wonder why the st->state is only assigned for this if block. The state is
> not set for other cases where CSTATE_ABORTED is returned.

Yeah, that's a simple oversight.  We don't need to set st->state,
because the caller sets it to the value we return.

> Should PQ_PIPELINE_OFF be returned for the following case ?
> 
> +PQpipelineStatus(const PGconn *conn)
> +{
> +   if (!conn)
> +   return false;

Yep.

Thanks

-- 
Álvaro Herrera39°49'30"S 73°17'W




Re: PATCH: Batch/pipelining support for libpq

2021-02-19 Thread Alvaro Herrera
On 2021-Jan-21, Zhihong Yu wrote:

> It seems '\\gset or \\aset is not ' would correspond to the check more
> closely.
> 
> +   if (my_command->argc != 1)
> +   syntax_error(source, lineno, my_command->first_line,
> my_command->argv[0],
> 
> It is possible that my_command->argc == 0 (where my_command->argv[0]
> shouldn't be accessed) ?

No -- process_backslash_command reads the word and sets it as argv[0].

> +   appendPQExpBufferStr(&conn->errorMessage,
> + libpq_gettext("cannot queue commands
> during COPY\n"));
> +   return false;
> +   break;
> 
> Is the break necessary ? Similar comment for pqBatchProcessQueue().

Not necessary.  I removed them.

> +int
> +PQexitBatchMode(PGconn *conn)
> 
> Since either 0 or 1 is returned, maybe change the return type to bool.

I was kinda tempted to do that, but in reality a lot of libpq's API is
defined like that -- to return 0 (failure)/1 (success) as ints, not
bools.  For example see PQsendQuery().  I'm not inclined to do different
for these new functions.  (One curious case is PQsetvalue, which returns
int, and is documented as "returns zero for failure" and then uses
"return false").

> Also, the function operates on PGconn - should the function be static
> (pqBatchProcessQueue is) ?

I don't understand this suggestion.  How would an application call it,
if we make it static?

Thanks

-- 
Álvaro Herrera   Valdivia, Chile
"Cómo ponemos nuestros dedos en la arcilla del otro. Eso es la amistad; jugar
al alfarero y ver qué formas se pueden sacar del otro" (C. Halloway en
La Feria de las Tinieblas, R. Bradbury)




Re: Extensibility of the PostgreSQL wire protocol

2021-02-19 Thread Damir Simunic


> On 19 Feb 2021, at 19:30, Jan Wieck  wrote:
> 
> An "extended" libpq protocol could allow the pool to give clients a unique 
> ID. The protocol handler would then maintain maps with the SQL of prepared 
> statements and what the client thinks their prepared statement name is. 

Or, the connection pooler could support a different wire protocol that has some 
form of client cookies and could let the client hold on to an opaque token to 
present back with every query and use that to route to the right backend with a 
prepared statement for that client (or match the appropriate cached p statement 
from the cache), even across client disconnections.

> Most of that would of course be possible on the pool side itself. But the 
> internal structure of pgbouncer isn't suitable for that. It is very 
> lightweight and for long SQL queries may never have the complete 'P' message 
> in memory. It would also not have direct access to security related 
> information like the search path, which would require extra round trips 
> between the pool and the backend to retrieve it.

> 
> So while not suitable to create a built in pool by itself, loadable wire 
> protocols can definitely help with connection pooling.

I think loadable wire protocols will have a positive effect on developing more 
sophisticated connection poolers.

> I also am not sure if building a connection pool into a background worker or 
> postmaster is a good idea to begin with. One of the important features of a 
> pool is to be able to suspend traffic and make the server completely idle to 
> for example be able to restart the postmaster without forcibly disconnecting 
> all clients.

Agreed. Going even further, a connection pooler supporting a protocol like quic 
(where the notion of connection is decoupled from the actual socket connection) 
could help a lot with balancing load between servers and data centers, which 
also would not be convenient for the actual Postgres to do with present 
architecture. (And here, too, a pluggable wire protocol would help with keeping 
tabs on individual backends).

--
Damir



Re: PATCH: Batch/pipelining support for libpq

2021-02-19 Thread Zhihong Yu
Hi,
+static int pqBatchProcessQueue(PGconn *conn);

I was suggesting changing:

+int
+PQexitBatchMode(PGconn *conn)

to:

+static int
+PQexitBatchMode(PGconn *conn)

Cheers

On Fri, Feb 19, 2021 at 10:43 AM Alvaro Herrera 
wrote:

> On 2021-Jan-21, Zhihong Yu wrote:
>
> > It seems '\\gset or \\aset is not ' would correspond to the check more
> > closely.
> >
> > +   if (my_command->argc != 1)
> > +   syntax_error(source, lineno, my_command->first_line,
> > my_command->argv[0],
> >
> > It is possible that my_command->argc == 0 (where my_command->argv[0]
> > shouldn't be accessed) ?
>
> No -- process_backslash_command reads the word and sets it as argv[0].
>
> > +   appendPQExpBufferStr(&conn->errorMessage,
> > + libpq_gettext("cannot queue commands
> > during COPY\n"));
> > +   return false;
> > +   break;
> >
> > Is the break necessary ? Similar comment for pqBatchProcessQueue().
>
> Not necessary.  I removed them.
>
> > +int
> > +PQexitBatchMode(PGconn *conn)
> >
> > Since either 0 or 1 is returned, maybe change the return type to bool.
>
> I was kinda tempted to do that, but in reality a lot of libpq's API is
> defined like that -- to return 0 (failure)/1 (success) as ints, not
> bools.  For example see PQsendQuery().  I'm not inclined to do different
> for these new functions.  (One curious case is PQsetvalue, which returns
> int, and is documented as "returns zero for failure" and then uses
> "return false").
>
> > Also, the function operates on PGconn - should the function be static
> > (pqBatchProcessQueue is) ?
>
> I don't understand this suggestion.  How would an application call it,
> if we make it static?
>
> Thanks
>
> --
> Álvaro Herrera   Valdivia, Chile
> "Cómo ponemos nuestros dedos en la arcilla del otro. Eso es la amistad;
> jugar
> al alfarero y ver qué formas se pueden sacar del otro" (C. Halloway en
> La Feria de las Tinieblas, R. Bradbury)
>


Re: Extensibility of the PostgreSQL wire protocol

2021-02-19 Thread Álvaro Hernández



On 19/2/21 14:48, Heikki Linnakangas wrote:
> [...]
>
> I can see value in supporting different protocols. I don't like the
> approach discussed in this thread, however.
>
> For example, there has been discussion elsewhere about integrating
> connection pooling into the server itself. For that, you want to have
> a custom process that listens for incoming connections, and launches
> backends independently of the incoming connections. These hooks would
> not help with that.
>
> Similarly, if you want to integrate a web server into the database
> server, you probably also want some kind of connection pooling. A
> one-to-one relationship between HTTP connections and backend processes
> doesn't seem nice.

    While I'm far from an HTTP/2 expert and I may be wrong, from all I
know HTTP/2 allows to create full-duplex, multiplexed, asynchronous
channels. So multiple connections can be funneled through a single
connection. This decreases the need and aim for connection pooling. It
doesn't eliminate it completely, as you may have the channel busy if a
single tenant is sending a lot of data; and you could not have more than
one concurrent action from a single tenant. OTOH, given these HTTP/2
properties, a non-pooled HTTP/2 endpoint may provide already significant
benefits due to the multiplexing capabilities.

    So definitely we don't need to consider an HTTP endpoint would
entail a 1:1 mapping between connections and backend processes.


    Álvaro

-- 

Alvaro Hernandez


---
OnGres






Re: Extensibility of the PostgreSQL wire protocol

2021-02-19 Thread Álvaro Hernández



On 19/2/21 19:30, Jan Wieck wrote:
> [...]
>
> I also am not sure if building a connection pool into a background
> worker or postmaster is a good idea to begin with. One of the
> important features of a pool is to be able to suspend traffic and make
> the server completely idle to for example be able to restart the
> postmaster without forcibly disconnecting all clients. A pool built
> into a background worker cannot do that.
>
>

    In my opinion, there are different reasons to use a connection pool,
that lead to different placements of that connection pool on the
architecture of the system. The ability of a pool to suspend (pause)
traffic and apply live re-configurations is a very important one to
implement high availability practices, transparent scaling, and others.
But these poolers belong to middleware layers (as in different processes
in different servers), where these pausing operations make complete sense.

    Connection poolers fronting the database have other specific
missions, namely to control the fan-in of connections to the database.
These connection poolers make sense being as close to the database as
possible (ideally: embedded) but don't need to perform pause operations
here.


    Álvaro


-- 

Alvaro Hernandez


---
OnGres






Re: PATCH: Batch/pipelining support for libpq

2021-02-19 Thread Alvaro Herrera
Hi,

On 2021-Feb-19, Zhihong Yu wrote:

> Hi,
> +static int pqBatchProcessQueue(PGconn *conn);
> 
> I was suggesting changing:
> 
> +int
> +PQexitBatchMode(PGconn *conn)
> 
> to:
> 
> +static int
> +PQexitBatchMode(PGconn *conn)

I understand that, but what I'm saying is that it doesn't work.
pqBatchProcessQueue can be static because it's only used internally in
libpq, but PQexitBatchMode is supposed to be called by the client
application.

-- 
Álvaro Herrera   Valdivia, Chile
"No necesitamos banderas
 No reconocemos fronteras"  (Jorge González)




Re: PATCH: Batch/pipelining support for libpq

2021-02-19 Thread Zhihong Yu
Hi,
Thanks for the response.

On Fri, Feb 19, 2021 at 12:46 PM Alvaro Herrera 
wrote:

> Hi,
>
> On 2021-Feb-19, Zhihong Yu wrote:
>
> > Hi,
> > +static int pqBatchProcessQueue(PGconn *conn);
> >
> > I was suggesting changing:
> >
> > +int
> > +PQexitBatchMode(PGconn *conn)
> >
> > to:
> >
> > +static int
> > +PQexitBatchMode(PGconn *conn)
>
> I understand that, but what I'm saying is that it doesn't work.
> pqBatchProcessQueue can be static because it's only used internally in
> libpq, but PQexitBatchMode is supposed to be called by the client
> application.
>
> --
> Álvaro Herrera   Valdivia, Chile
> "No necesitamos banderas
>  No reconocemos fronteras"  (Jorge González)
>


Re: [HACKERS] Custom compression methods

2021-02-19 Thread Robert Haas
On Fri, Feb 19, 2021 at 11:12 AM Dilip Kumar  wrote:
> I had an off list discussion with Robert and based on his suggestion
> and a poc patch, I have come up with an updated version for handling
> the composite type.  Basically, the problem was that ExecEvalRow we
> are first forming the tuple and then we are calling
> HeapTupleHeaderGetDatum and then we again need to deform to find any
> compressed data so that can cause huge performance penalty in all
> unrelated paths which don't even contain any compressed data.  So
> Robert's idea was to check for the compressed/external data even
> before forming the tuple.  I have implemented that and I can see we
> are not seeing any performance penalty.

I think that these performance tests aren't really exercising the
expanded-record stuff, just the ExecEvalRow changes. We need to test
that test case, and I tend to suspect there's going to be a measurable
regression.

I spent some time looking at how datums get into the expanded record
system. There seem to be four possible paths:
expanded_record_set_tuple(), make_expanded_record_from_datum(),
expanded_record_set_field_internal(), and
expanded_record_set_fields(). The first two of these inject an entire
tuple, while the latter two work on a field-by-field basis. For that
reason, the latter two are not really problematic. I'm not quite sure
what the right thing to do is here, but if we wanted to check whether
a Datum that we're absorbing is non-pglz-compressed in those places,
it would be easy to do. Also, as far as I can see,
make_expanded_record_from_datum() is completely unused. So the problem
case is where expanded_record_set_tuple() is getting called, and
specifically where it's being called with expand_external = true. Any
place that it's being called with expand_external = false, there's
apparently no problem with the result tuple containing external
datums, so probably non-pglz compressed data is OK there too.

All of the places that can potentially pass expand_external = true are
actually passing !estate->atomic, where estate is a PLpgSQL_execstate.
In other words, I think the case where this happens is when we're in a
context where the computed value could need to survive across a COMMIT
or ROLLBACK, like there may be a procedure running (maybe more than
one, each invoking the next via CALL) but there are no queries in
progress. We have to expand TOAST references because committing a
transaction that deleted data means you might not be able to resolve
the old TOAST pointer any more: even if you use a snapshot that can
see everything, VACUUM could nuke the deleted rows - or some of them -
at any time. To avoid trouble we have to un-externalize before any
COMMIT or ROLLBACK occurs. That can suck for performance because we
might be fetching a big value that we don't end up using for anything
- say if the variable isn't used again - but it beats failing.

The argument that we need to force decompression in such cases is
considerably more tenuous. It revolves around the possibility that the
compression AM itself has been dropped. As long as we have only
built-in compression methods, which are undroppable, it seems like we
could potentially just decide to do nothing at all about this issue.
If the only reason for expanding TOAST pointers inside the
expanded-record stuff is to avoid the possibility of one being
invalidated by a transaction commit, and if compression methods can't
be invalidated by a transaction commit, well then we don't really have
a problem. That's not a great solution in terms of our chances of
getting this whole patch series committed, but it might at least be
enough to unblock the first few patches, and we could document the
rest of the issue for later research.

What makes me a bit uncomfortable about that approach is that it
presupposes that everything that uses expanded records has some other
defense against those tuples getting written to disk without first
expanding any external datums. And it isn't obvious that this is the
case, or at least not to me. For example, PLpgsql's
coerce_function_result_tuple()'s code for
VARATT_IS_EXTERNAL_EXPANDED() has three cases. The first case passes
the tuple through SPI_returntuple() which calls
heap_copy_tuple_as_datum() which calls toast_flatten_tuple_to_datum()
if required, but the second case calls EOH_flatten_into() and does NOT
pass the result through SPI_returntuple(). And ER_flatten_info() has
no defense against this case that I can see: sure, it skips the fast
path if ER_FLAG_HAVE_EXTERNAL is set, but that doesn't actually do
anything to resolve TOAST pointers. Maybe there's no bug there for
some reason, but I don't know what that reason might be. We seem to
have no test cases either in the main test suite or in the plpgsql
test suite where ER_flatten_info gets called with
ER_FLAG_HAVE_EXTERNAL is set, which seems a little unfortunate. If
there is such a bug here it's independent of this patch, I suppose,
but it would still be nice to under

Re: Finding cause of test fails on the cfbot site

2021-02-19 Thread Peter Smith
Here is another related question about the cfbot error reporting -

The main cfbot "status page" [1] still shows a couple of fails for the
32/2914 (for freebsd & linux). But looking more closely, those fails
are not from the latest run. e.g. I also found this execution
"history" page [2] for our patch which shows the most recent run was
ok for commit a7e715.

~~

So it seems like there is some kind of rule that says the main status
page will still indicate "recent* errors (even if the latest execution
was ok)...

IIUC that explains the difference between a hollow red 'X' (old fail)
and a solid red 'X' fail (new fail)? And I am guessing if our patch
continues to work ok (for how long?) then that hollow red 'X' may
morph into a solid green 'tick' (when the old fail becomes too old to
care about anymore)?

But those are just my guesses based on those icon tooltips. What
*really* are the rules for those main page status indicators, and how
long do the old failure icons linger around before changing to success
icons? (Apologies if a legend for those icons is already described
somewhere - I didn't find it).

Thanks!

--
[1] http://cfbot.cputube.org/
[2] https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/32/2914

Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Finding cause of test fails on the cfbot site

2021-02-19 Thread Thomas Munro
On Sat, Feb 20, 2021 at 10:31 AM Peter Smith  wrote:
> Here is another related question about the cfbot error reporting -
>
> The main cfbot "status page" [1] still shows a couple of fails for the
> 32/2914 (for freebsd & linux). But looking more closely, those fails
> are not from the latest run. e.g. I also found this execution
> "history" page [2] for our patch which shows the most recent run was
> ok for commit a7e715.
>
> ~~
>
> So it seems like there is some kind of rule that says the main status
> page will still indicate "recent* errors (even if the latest execution
> was ok)...

Hmmph.  It seems like there is indeed some kind of occasional glitch,
possible a bug in my code for picking up statuses from Cirrus through
their GraphQL API (that's a recent thing; we had to change providers
due to another CI's policy changes in January, and apparently
something in the new pipeline isn't quite fully baked).  Will look
into that this weekend.  Sorry about that, and thanks for letting me
know.

> IIUC that explains the difference between a hollow red 'X' (old fail)
> and a solid red 'X' fail (new fail)? And I am guessing if our patch
> continues to work ok (for how long?) then that hollow red 'X' may
> morph into a solid green 'tick' (when the old fail becomes too old to
> care about anymore)?
>
> But those are just my guesses based on those icon tooltips. What
> *really* are the rules for those main page status indicators, and how
> long do the old failure icons linger around before changing to success
> icons? (Apologies if a legend for those icons is already described
> somewhere - I didn't find it).

Yeah, I will try to clarify the UI a bit...




Re: [HACKERS] Custom compression methods

2021-02-19 Thread Robert Haas
On Fri, Feb 19, 2021 at 4:21 PM Robert Haas  wrote:
> What makes me a bit uncomfortable about that approach is that it
> presupposes that everything that uses expanded records has some other
> defense against those tuples getting written to disk without first
> expanding any external datums. And it isn't obvious that this is the
> case, or at least not to me. For example, PLpgsql's
> coerce_function_result_tuple()'s code for
> VARATT_IS_EXTERNAL_EXPANDED() has three cases. The first case passes
> the tuple through SPI_returntuple() which calls
> heap_copy_tuple_as_datum() which calls toast_flatten_tuple_to_datum()
> if required, but the second case calls EOH_flatten_into() and does NOT
> pass the result through SPI_returntuple(). And ER_flatten_info() has
> no defense against this case that I can see: sure, it skips the fast
> path if ER_FLAG_HAVE_EXTERNAL is set, but that doesn't actually do
> anything to resolve TOAST pointers. Maybe there's no bug there for
> some reason, but I don't know what that reason might be. We seem to
> have no test cases either in the main test suite or in the plpgsql
> test suite where ER_flatten_info gets called with
> ER_FLAG_HAVE_EXTERNAL is set, which seems a little unfortunate. If
> there is such a bug here it's independent of this patch, I suppose,
> but it would still be nice to understand what's going on here better
> than I do.

Andres just pointed out to me the error of my thinking here:
ER_flatten_into can *never* encounter a case with both
ER_FLAG_FVALUE_VALID and ER_FLAG_HAVE_EXTERNAL, because
ER_get_flat_size has to get called first, and will de-toast external
values as it goes. So there actually is justification for
coerce_function_result_tuple() to skip the call to SPI_returntuple().

Given that, one might wonder why the test in ER_flatten_into() even
cares about ER_FLAG_HAVE_EXTERNAL in the first place... I suppose it's
just a harmless oversight.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: PATCH: Batch/pipelining support for libpq

2021-02-19 Thread Craig Ringer
On Wed, 17 Feb 2021, 07:13 Alvaro Herrera,  wrote:

> Here's a new version, where I've renamed everything to "pipeline".


Wow. Thanks.

I
> think the docs could use some additional tweaks now in order to make a
> coherent story on pipeline mode, how it can be used in a batched
> fashion, etc.
>

I'll do that soon and send an update.

>
>


Re: pg_config_h.in not up-to-date

2021-02-19 Thread Michael Paquier
On Fri, Feb 19, 2021 at 09:57:22AM -0500, Tom Lane wrote:
> Hm.  It should be consistent with the rest, for sure.  Personally I'd put
> the only period at the end, but I suppose we should stick with the
> prevailing style if there is one.

Thanks.  I have just used the same style as XML, LDAP and LLVM then.
Thanks Antonin for the report.
--
Michael


signature.asc
Description: PGP signature


Re: PATCH: Batch/pipelining support for libpq

2021-02-19 Thread Alvaro Herrera
On 2021-Feb-20, Craig Ringer wrote:

> On Wed, 17 Feb 2021, 07:13 Alvaro Herrera,  wrote:

> > I
> > think the docs could use some additional tweaks now in order to make a
> > coherent story on pipeline mode, how it can be used in a batched
> > fashion, etc.
> 
> I'll do that soon and send an update.

I started to do that, but was sidetracked having a look at error
handling while checking one of the claims.

So now it starts with this:

   
Issuing Queries


 After entering pipeline mode, the application dispatches requests using
 , 
 or its prepared-query sibling
 .
 These requests are queued on the client-side until flushed to the server;
 this occurs when  is used to
 establish a synchronization point in the pipeline,
 or when  is called.
 The functions ,
 , and
  also work in pipeline mode.
 Result processing is described below.



 The server executes statements, and returns results, in the order the
 client sends them.  The server will begin executing the commands in the
 pipeline immediately, not waiting for the end of the pipeline.
 If any statement encounters an error, the server aborts the current
 transaction and skips processing commands in the pipeline until the
 next synchronization point established by 
PQsendPipeline.
 (This remains true even if the commands in the pipeline would rollback
 the transaction.)
 Query processing resumes after the synchronization point.


(Note I changed the wording that "the pipeline is ended by
PQsendPipeline" to "a synchronization point is established".  Is this
easily understandable?  On re-reading it, I'm not sure it's really an
improvement.)

BTW we don't explain why doesn't PQsendQuery work (to wit: because it
uses "simple" query protocol and thus doesn't require the Sync message).
I think we should either document that, or change things so that
PQsendQuery no longer uses a 'Q' message when in pipeline mode; instead
use extended query protocol.  I don't see why we'd force people to use
PQsendQueryParams when not necessary.

BTW, in my experimentation, the sequence of PGresult that you get when
handling a pipeline with errors don't make a lot of sense.  I'll spend
some more time on it.

While at it, as for the PGresult sequence and NULL returns: I think for
PQexec and maybe PQsendQuery, it makes sense to loop until PQgetResult
returns NULL, because you never know how many results are you going to
get.  But for PQsendQueryParams, this is no longer true, because you
can't send multiple queries in one query string.  So the only way to get
multiple results, is by using single-row mode.  But that already has its
own protocol for multiple results, namely to get a stream of
PGRES_SINGLE_TUPLE terminated by a zero-rows PGRES_TUPLES_OK.  So I'm
not sure there is a strong need for the mandatory NULL result.

-- 
Álvaro Herrera39°49'30"S 73°17'W




Re: progress reporting for partitioned REINDEX

2021-02-19 Thread Michael Paquier
On Fri, Feb 19, 2021 at 12:12:54AM -0600, Justin Pryzby wrote:
> Looks fine.

Thanks, applied then to clarify things.

> Also, I noticed that vacuum recurses into partition heirarchies since v10, but
> pg_stat_progress_vacuum also doesn't show anything about the parent table or
> the progress of recursing through the hierarchy.

Yeah, that's an area where it would be possible to improve the
monitoring, for both autovacuums and manual VACUUMs.
--
Michael


signature.asc
Description: PGP signature


Re: A reloption for partitioned tables - parallel_workers

2021-02-19 Thread Amit Langote
On Fri, Feb 19, 2021 at 11:54 PM Seamus Abshere  wrote:
> On Fri, Feb 19, 2021, at 2:30 AM, Amit Langote wrote:
> > Here is an updated version of the Seamus' patch that takes into
> > account these and other comments received on this thread so far.
> > Maybe warrants adding some tests too but I haven't.
> >
> > Seamus, please register this patch in the next commit-fest:
> > https://commitfest.postgresql.org/32/
> >
> > If you haven't already, you will need to create a community account to
> > use that site.
>
> hi Amit,
>
> Thanks so much for doing this. I had created
>
> https://commitfest.postgresql.org/32/2987/

Ah, sorry, I had not checked.  I updated the entry to add you as the author.

> and it looks like it now shows your patch as the one to use. Let me know if I 
> should do anything else.

You could take a look at the latest patch and see if you find any
problems with my or others' suggestions that I implemented in the v2
patch.  Also, please add regression tests for the new feature in
src/test/regress/sql/select_parallel.sql.

-- 
Amit Langote
EDB: http://www.enterprisedb.com




UniqueKey on Partitioned table.

2021-02-19 Thread Andy Fan
Suppose we have partitioned table defined as below:

P(a, b, c, d)  partition by (a)
  p1  (a=1)
  p2  (a=2)
  p3  (a=3)

Since in PG, we can define different indexes among partitions, so each
child may
have different UniqueKeys, and some of them might be invalidated in parent's
level. For example, 1). we only define unique index p1(c), so (c) would be
an
uniquekey of p1 only.  so it is invalidate on appendrel level.  2). We
define
unique index p_n(c) on each childrel,  so every childrel has UniqueKey
(c). However it is invalid on appendrel as well.  3). We define unique index
p_n(a, c), since a is the partition key, so (a, c) would be valid for both
child rel and parent rel.


In my v1 implementation[1] before, I maintained the child rel exactly the
same as
non-partitioned table. But when calculating the UniqueKey for partitioned
table, I first introduce a global_unique_indexlist which handles the above 3
cases. The indexes for case 1 and case 2 will not be in
global_unique_indexlist
but the index in case 3 will be even if they are only built in child level.
After
we have build the global_unique_indexlist on appendrel, we will build the
UnqiueKey exactly same as non partitioned table. With this way,  I'm not
happy
with the above method now is because 1). the global_unique_indexlist is
build in
a hard way. 2). I have to totally ignored the UniqueKey on child level and
re-compute it on appendrel level. 3). The 3 cases should rarely happen in
real
life, I guess.

When I re-implement the UniqueKey with EquivalenceClass, I re-think about
how to
handle the above stuff. Now my preferred idea is just not handle it. When
building the
uniquekey on parent rel, we just handle 2 cases. If the appendrel only have
1
child, we just copy (and modified if needed due to col-order-mismatch case)
the
uniquekey.  2). Only handle the Unique index defined in top level, for this
case
it would yield the below situation.

create unique index on p(a, b); --> (A, B) will be the UniqueKey of p.
create unique index on p_nn(a, b); --> (A, B) will not be the UniqueKey of p
even we create it on ALL the child rel. The result is not perfect but I
think
it is practical.  Any suggestions?

The attached is a UnqiueKey with EquivalenceClass patch, I just complete the
single relation part and may have bugs. I just attached it here for design
review only. and the not-null-attrs is just v1 which we can continue
discussing on
the original thread[2].

[1]
https://www.postgresql.org/message-id/CAKU4AWr1BmbQB4F7j22G%2BNS4dNuem6dKaUf%2B1BK8me61uBgqqg%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/flat/caku4awpqjaqjwq2x-ar9g3+zhrzu1k8hnp7a+_mluov-n5a...@mail.gmail.com


-- 
Best Regards
Andy Fan (https://www.aliyun.com/)


v1-0001-Introduce-notnullattrs-field-in-RelOptInfo-to-ind.patch
Description: Binary data


v1-0002-UniqueKey-with-EquivalenceClass-for-single-rel-on.patch
Description: Binary data


Re: progress reporting for partitioned REINDEX

2021-02-19 Thread Justin Pryzby
On Sat, Feb 20, 2021 at 10:37:08AM +0900, Michael Paquier wrote:
> > Also, I noticed that vacuum recurses into partition heirarchies since v10, 
> > but
> > pg_stat_progress_vacuum also doesn't show anything about the parent table or
> > the progress of recursing through the hierarchy.
> 
> Yeah, that's an area where it would be possible to improve the
> monitoring, for both autovacuums and manual VACUUMs.

I was thinking that instead of reporting partitions_done/partitions_total in
the individual progress views, maybe the progress across partitions should be
reported in a separate pg_stat_progress_partitioned.  This would apply to my
CLUSTER patch as well as VACUUM.  I haven't though about the implementation,
though.

If the partitions_done/total were *removed* from the create_index view, that
would resolve the odd behavior that a single row simultanously shows 1) the
overall progress of the operation across partitions; and, 2) the detailed
information about the status of the operation on the current leaf partition.

However I guess it's not general enough to support progress reports of
execution of planned (not utility) statements.

-- 
Justin




Re: repeated decoding of prepared transactions

2021-02-19 Thread Amit Kapila
On Fri, Feb 19, 2021 at 8:23 PM Markus Wanner
 wrote:
>
> With that line of thinking, the point in time (or in WAL) of the COMMIT
> PREPARED does not matter at all to reason about the decoding of the
> PREPARE operation.  Instead, there are only exactly two cases to consider:
>
> a) the PREPARE happened before the start_decoding_at LSN and must not be
> decoded. (But the effects of the PREPARE must then be included in the
> initial synchronization. If that's not supported, the output plugin
> should not enable two-phase commit.)
>

I see a problem with this assumption. During the initial
synchronization, this transaction won't be visible to snapshot and we
won't copy it. Then later if we won't decode and send it then the
replica will be out of sync. Such a problem won't happen with Ajin's
patch.

-- 
With Regards,
Amit Kapila.




Re: A reloption for partitioned tables - parallel_workers

2021-02-19 Thread Laurenz Albe
On Fri, 2021-02-19 at 16:30 +0900, Amit Langote wrote:
> On Tue, Feb 16, 2021 at 1:35 AM Seamus Abshere  wrote:
> > > Here we go, my first patch... solves 
> > > https://www.postgresql.org/message-id/7d6fdc20-857c-4cbe-ae2e-c0ff9520e...@www.fastmail.com
> 
> Here is an updated version of the Seamus' patch that takes into
> account these and other comments received on this thread so far.
> Maybe warrants adding some tests too but I haven't.

Yes, there should be regression tests.

I gave the patch a spin, and it allows to raise the number of workers for
a parallel append as advertised.

--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1337,8 +1337,9 @@ WITH ( MODULUS numeric_literal, REM
 If a table parameter value is set and the
 equivalent toast. parameter is not, the TOAST table
 will use the table's parameter value.
-Specifying these parameters for partitioned tables is not supported,
-but you may specify them for individual leaf partitions.
+Specifying most of these parameters for partitioned tables is not
+supported, but you may specify them for individual leaf partitions;
+refer to the description of individual parameters for more details.


This doesn't make me happy.  Since the options themselves do not say if they
are supported on partitioned tables or not, the reader is left in the dark.

Perhaps:

  These options, with the exception of parallel_workers,
  are not supported on partitioned tables, but you may specify them for 
individual
  leaf partitions.

@@ -1401,9 +1402,12 @@ WITH ( MODULUS numeric_literal, REM
  
   This sets the number of workers that should be used to assist a parallel
   scan of this table.  If not set, the system will determine a value based
-  on the relation size.  The actual number of workers chosen by the planner
-  or by utility statements that use parallel scans may be less, for example
-  due to the setting of .
+  on the relation size.  When set on a partitioned table, the specified
+  number of workers will work on distinct partitions, so the number of
+  partitions affected by the parallel operation should be taken into
+  account.  The actual number of workers chosen by the planner or by
+  utility statements that use parallel scans may be less, for example due
+  to the setting of .
  
 


The reader is left to believe that the default number of workers depends on the
size of the partitioned table, which is not entirely true.

Perhaps:

  If not set, the system will determine a value based on the relation size and
  the number of scanned partitions.

--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1268,6 +1268,59 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo 
*rel,
add_paths_to_append_rel(root, rel, live_childrels);
 }
 
+/*
+ * compute_append_parallel_workers
+ * Computes the number of workers to assign to scan the subpaths 
appended
+ * by a given Append path
+ */
+static int
+compute_append_parallel_workers(RelOptInfo *rel, List *subpaths,
+   int num_live_children,
+   bool parallel_append)

The new function should have a prototype.

+{
+   ListCell   *lc;
+   int parallel_workers = 0;
+
+   /*
+* For partitioned rels, first see if there is a root-level setting for
+* parallel_workers.  But only consider if a Parallel Append plan is
+* to be considered.
+*/
+   if (IS_PARTITIONED_REL(rel) && parallel_append)
+   parallel_workers =
+   compute_parallel_worker(rel, -1, -1,
+   max_parallel_workers_per_gather);
+
+   /* Find the highest number of workers requested for any subpath. */
+   foreach(lc, subpaths)
+   foreach(lc, subpaths)
+   {
+   Path   *path = lfirst(lc);
+
+   parallel_workers = Max(parallel_workers, path->parallel_workers);
+   }
+   Assert(parallel_workers > 0 || subpaths == NIL);
+
+   /*
+* If the use of parallel append is permitted, always request at least
+* log2(# of children) workers.  We assume it can be useful to have
+* extra workers in this case because they will be spread out across
+* the children.  The precise formula is just a guess, but we don't
+* want to end up with a radically different answer for a table with N
+* partitions vs. an unpartitioned table with the same data, so the
+* use of some kind of log-scaling here seems to make some sense.
+*/
+   if (parallel_append)
+   {
+   parallel_workers = Max(parallel_workers,
+  fls(num_live_children));
+   parallel_workers = Min(parallel_workers,
+  max_parallel_workers_per_gather);
+   }
+   Assert(parallel_workers > 0);
+
+   return parallel_workers;
+}

That means that it is not possible to *lower* the number of parallel workers
with this re

Re: repeated decoding of prepared transactions

2021-02-19 Thread Amit Kapila
On Fri, Feb 19, 2021 at 8:21 AM Ajin Cherian  wrote:
>
> Based on this suggestion, I have created a patch on HEAD which now
> does not allow repeated decoding
> of prepared transactions. For this, the code now enforces
> full_snapshot if two-phase decoding is enabled.
> Do have a look at the patch and see if you have any comments.
>

Few minor comments:
===
1.
.git/rebase-apply/patch:135: trailing whitespace.
 * We need to mark the transaction as prepared, so that we
don't resend it on
warning: 1 line adds whitespace errors.

Whitespace issue.

2.
/*
+ * Set snapshot type
+ */
+void
+SetSnapBuildType(SnapBuild *builder, bool need_full_snapshot)

There is no caller which passes the second parameter as false, so why
have it? Can't we have a function with SetSnapBuildFullSnapshot or
something like that?

3.
@@ -431,6 +431,10 @@ CreateInitDecodingContext(const char *plugin,
  startup_cb_wrapper(ctx, &ctx->options, true);
  MemoryContextSwitchTo(old_context);

+ /* If two-phase is on, then only full snapshot can be used */
+ if (ctx->twophase)
+ SetSnapBuildType(ctx->snapshot_builder, true);
+
  ctx->reorder->output_rewrites = ctx->options.receive_rewrites;

  return ctx;
@@ -534,6 +538,10 @@ CreateDecodingContext(XLogRecPtr start_lsn,

  ctx->reorder->output_rewrites = ctx->options.receive_rewrites;

+ /* If two-phase is on, then only full snapshot can be used */
+ if (ctx->twophase)
+ SetSnapBuildType(ctx->snapshot_builder, true);

I think it is better to add a detailed comment on why we are doing
this? You can write the comment in one of the places.

> Currently  one problem with this, as you have also mentioned in your
> last mail,  is that if initially two-phase is disabled in
> test_decoding while
> decoding prepare (causing the prepared transaction to not be decoded)
> and later enabled after the commit prepared (where it assumes that the
> transaction was decoded at prepare time), then the transaction is not
> decoded at all. For eg:
>
> postgres=# begin;
> BEGIN
> postgres=*# INSERT INTO do_write DEFAULT VALUES;
> INSERT 0 1
> postgres=*# PREPARE TRANSACTION 'test1';
> PREPARE TRANSACTION
> postgres=# SELECT data  FROM
> pg_logical_slot_get_changes('isolation_slot', NULL, NULL,
> 'include-xids', 'false', 'skip-empty-xacts', '1', 'two-phase-commit',
> '0');
> data
> --
> (0 rows)
> postgres=# commit prepared 'test1';
> COMMIT PREPARED
> postgres=# SELECT data  FROM
> pg_logical_slot_get_changes('isolation_slot', NULL, NULL,
> 'include-xids', 'false', 'skip-empty-xacts', '1', 'two-phase-commit',
> '1');
>  data
> -
> COMMIT PREPARED 'test1' (1 row)
>
> 1st pg_logical_slot_get_changes is called with two-phase-commit off,
> 2nd is called with two-phase-commit on. You can see that the
> transaction is not decoded at all.
> For this, I am planning to change the semantics such that
> two-phase-commit can only be specified while creating the slot using
> pg_create_logical_replication_slot()
> and not in pg_logical_slot_get_changes, thus preventing
> two-phase-commit flag from being toggled between restarts of the
> decoder.
>

+1.

-- 
With Regards,
Amit Kapila.




Re: [HACKERS] Custom compression methods

2021-02-19 Thread Dilip Kumar
On Sat, Feb 20, 2021 at 2:51 AM Robert Haas  wrote:
>
> On Fri, Feb 19, 2021 at 11:12 AM Dilip Kumar  wrote:
> I think that these performance tests aren't really exercising the
> expanded-record stuff, just the ExecEvalRow changes. We need to test
> that test case, and I tend to suspect there's going to be a measurable
> regression.

I will do testing around this area.

> I spent some time looking at how datums get into the expanded record
> system. There seem to be four possible paths:
> expanded_record_set_tuple(), make_expanded_record_from_datum(),
> expanded_record_set_field_internal(), and
> expanded_record_set_fields(). The first two of these inject an entire
> tuple, while the latter two work on a field-by-field basis. For that
> reason, the latter two are not really problematic. I'm not quite sure
> what the right thing to do is here, but if we wanted to check whether
> a Datum that we're absorbing is non-pglz-compressed in those places,
> it would be easy to do. Also, as far as I can see,
> make_expanded_record_from_datum() is completely unused. So the problem
> case is where expanded_record_set_tuple() is getting called, and
> specifically where it's being called with expand_external = true. Any
> place that it's being called with expand_external = false, there's
> apparently no problem with the result tuple containing external
> datums, so probably non-pglz compressed data is OK there too.
> All of the places that can potentially pass expand_external = true are
> actually passing !estate->atomic, where estate is a PLpgSQL_execstate.
> In other words, I think the case where this happens is when we're in a
> context where the computed value could need to survive across a COMMIT
> or ROLLBACK, like there may be a procedure running (maybe more than
> one, each invoking the next via CALL) but there are no queries in
> progress. We have to expand TOAST references because committing a
> transaction that deleted data means you might not be able to resolve
> the old TOAST pointer any more: even if you use a snapshot that can
> see everything, VACUUM could nuke the deleted rows - or some of them -
> at any time. To avoid trouble we have to un-externalize before any
> COMMIT or ROLLBACK occurs. That can suck for performance because we
> might be fetching a big value that we don't end up using for anything
> - say if the variable isn't used again - but it beats failing.

I agree with most of this, but I don't think the only reason to
un-externalize is just for COMMIT or ROLLBACK.  I mean using the
trigger function we might insert a RECORD type to another table which
has the ROWTYPE as the table on which we are doing the operation.  See
below example

CREATE TABLE t1(a int, b varchar compression lz4);
INSERT INTO t1 select 1, repeat('a', 3000);
CREATE TABLE t2 (x t1 compression pglz);

CREATE OR REPLACE FUNCTION log_last_name_changes()
  RETURNS TRIGGER
  LANGUAGE PLPGSQL
  AS
$$
BEGIN
INSERT INTO t2 select OLD;
RETURN NEW;
END;
$$;

CREATE TRIGGER last_name_changes
  BEFORE UPDATE
  ON t1
  FOR EACH ROW
  EXECUTE PROCEDURE log_last_name_changes();

UPDATE t1 SET a=2;

SELECT pg_column_compression((t2.x).b) FROM t2;
 pg_column_compression
---
 lz4
(1 row)

So basically, in this case, we are not un-externalizing because of
ROLLBACK or COMMIT, instead, we are doing that because we want to
insert it into the new table.  So this is without my patch and without
my patch (v25_0001_Disallow_compressed_data_inside_container_types,
basically without the changes in expandedrecord.c).  Here is the call
stack when exactly this tuple gets flattened.

#0  expanded_record_set_field_internal (erh=0x2bbcfb0, fnumber=2,
newValue=45863112, isnull=false, expand_external=true,
check_constraints=false) at expandedrecord.c:1225
#1  0x009a1899 in ER_get_flat_size (eohptr=0x2bbcfb0) at
expandedrecord.c:713
#2  0x009a0954 in EOH_get_flat_size (eohptr=0x2bbcfb0) at
expandeddatum.c:77
#3  0x0048f61c in heap_compute_data_size (tupleDesc=0x2bd0168,
values=0x2bd02c8, isnull=0x2bd02d0) at heaptuple.c:155
#4  0x004916b3 in heap_form_tuple (tupleDescriptor=0x2bd0168,
values=0x2bd02c8, isnull=0x2bd02d0) at heaptuple.c:1045
#5  0x007296eb in tts_virtual_copy_heap_tuple (slot=0x2bd0280)
at execTuples.c:272
#6  0x00728d0b in ExecCopySlotHeapTuple (slot=0x2bd0280) at
../../../src/include/executor/tuptable.h:456
#7  0x0072a5d8 in tts_buffer_heap_copyslot (dstslot=0x2bd0820,
srcslot=0x2bd0280) at execTuples.c:767
#8  0x00758840 in ExecCopySlot (dstslot=0x2bd0820,
srcslot=0x2bd0280) at ../../../src/include/executor/tuptable.h:480
#9  0x0075c263 in ExecModifyTable (pstate=0x2bcfa08) at
nodeModifyTable.c:2264


> The argument that we need to force decompression in such cases is
> considerably more tenuous. It revolves around the possibility that the
> compression AM itself has been dropped. As long as we have only
> built-in compression methods, which are undro

Re: New Table Access Methods for Multi and Single Inserts

2021-02-19 Thread Bharath Rupireddy
On Wed, Feb 17, 2021 at 12:46 PM Bharath Rupireddy
 wrote:
> Hi,
>
> I addressed the following review comments and attaching v3 patch set.
>
> 1) ExecClearTuple happens before ExecCopySlot in heap_multi_insert_v2
> and this allowed us to remove clear_mi_slots flag from
> TableInsertState.
> 2) I retained the flushed variable inside TableInsertState so that the
> callers can know whether the buffered slots have been flushed. If yes,
> the callers can execute after insert row triggers or perform index
> insertions. This is easier than passing the after insert row triggers
> info and index info to new multi insert table am and let it do. This
> way the functionalities can be kept separate i.e. multi insert ams do
> only buffering, decisions on when to flush, insertions and the callers
> will execute triggers or index insertions. And also none of the
> existing table ams are performing these operations within them, so
> this is inline with the current design of the table ams.
> 3) I have kept the single and multi insert API separate. The previous
> suggestion was to have only a single insert API and let the callers
> provide initially whether they want multi or single inserts. One
> problem with that approach is that we have to allow table ams to
> execute the after row triggers or index insertions. That is something
> I personally don't like.
>
> 0001 - new table ams implementation
> 0002 - the new multi table ams used in CREATE TABLE AS and REFRESH
> MATERIALIZED VIEW
> 0003 - the new multi table ams used in COPY
>
> Please review the v3 patch set further.

Below is the performance gain measured for CREATE TABLE AS with the
new multi insert am propsed in this thread:

case 1 - 2 integer(of 4 bytes each) columns, 3 varchar(8), tuple size
59 bytes, 100mn tuples
on master - 185sec
on master with multi inserts - 121sec, gain - 1.52X

case 2 - 2 bigint(of 8 bytes each) columns, 3 name(of 64 bytes each)
columns, 1 varchar(8), tuple size 241 bytes, 100mn tuples
on master - 367sec
on master with multi inserts - 291sec, gain - 1.26X

case 3 - 2 integer(of 4 bytes each) columns, tuple size 32 bytes, 100mn tuples
on master - 130sec
on master with multi inserts - 105sec, gain - 1.23X

case 4 - 2 bigint(of 8 bytes each) columns, 16 name(of 64 bytes each)
columns, tuple size 1064 bytes, 10mn tuples
on master - 120sec
on master with multi inserts - 115sec, gain - 1.04X

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Improvements and additions to COPY progress reporting

2021-02-19 Thread Bharath Rupireddy
On Fri, Feb 19, 2021 at 2:34 AM Tomas Vondra
 wrote:
> > On Mon, 15 Feb 2021 at 17:07, Tomas Vondra
> >  wrote:
> >>
> >> - The blocks in copyfrom.cc/copyto.c should be reworked - I don't think
> >> we do this in our codebase.
> >
> > I saw this being used in (re)index progress reporting, that's where I
> > took inspiration from. It has been fixed in the attached version.
> >
>
> Hmmm, good point. I haven't looked at the other places reporting
> progress and I only ever saw this pattern in old code. I kinda dislike
> these blocks, but admittedly that's rather subjective view. So if other
> similar places do this when reporting progress, this probably should
> too. What's your opinion on this?

Actually in the code base the style of that variable declaration and
usage of pgstat_progress_update_multi_param is a mix. For instance, in
lazy_scan_heap, ReindexRelationConcurrently, the variables are
declared at the start of the function. And in _bt_spools_heapscan,
index_build, validate_index, perform_base_backup, the variables are
declared within a separate block.

IMO, we can have the arrays declared at the start of the functions
i.e. the way it's done in v8-0001, because we can extend them for
reporting some other parameter(maybe in future).

> >> - I fir the "io_target" name misleading, because in some cases it's
> >> actually the *source*.
> >
> > Yes, I was also not quite happy with this, but couldn't find a better
> > one at the point of writing the initial patchset. Would
> > "io_operations", "io_port", "operates_through" or "through" maybe be
> > better?
> >
>
> No idea. Let's see if someone has a better proposal ...

 For COPY TO the name "source_type" column and for COPY FROM the name
"destination_type" makes sense. To have a combined column name for
both, how about naming that column as "io_type"?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Asynchronous Append on postgres_fdw nodes.

2021-02-19 Thread Etsuro Fujita
On Thu, Feb 18, 2021 at 3:16 PM Kyotaro Horiguchi
 wrote:
> At Thu, 18 Feb 2021 11:51:59 +0900, Etsuro Fujita  
> wrote in
> > I noticed that this doesn’t work for cases where ForeignScans are
> > executed inside functions, and I don’t have any simple solution for
>
> Ah, concurrent fetches in different plan trees?  (For fairness, I
> hadn't noticed that case:p) The same can happen when an extension that
> is called via hooks.

Yeah, consider a plan containing a FunctionScan that invokes a query
like e.g., “SELECT * FROM foreign_table” via SPI.

> > So I’m getting back to what Horiguchi-san proposed for
> > postgres_fdw to handle concurrent fetches from a remote server
> > performed by multiple ForeignScan nodes that use the same connection.
> > As discussed before, we would need to create a scheduler for
> > performing such fetches in a more optimized way to avoid a performance
> > degradation in some cases, but that wouldn’t be easy.
>
> If the "degradation" means degradation caused by repeated creation of
> remote cursors, anyway every node on the same connection create its
> own connection named as "c" and never "re"created in any case.
>
> If the "degradation" means that my patch needs to wait for the
> previous prefetching query to return tuples before sending a new query
> (vacate_connection()), it is just moving the wait from just before
> sending the new query to just before fetching the next round of the
> previous node. The only case it becomes visible degradation is where
> the tuples in the next round is not wanted by the upper nodes.

The latter.  And yeah, typical cases where the performance degradation
occurs would be queries with LIMIT, as discussed in [1].

I’m not concerned about postgres_fdw modified to process an
in-progress fetch by a ForeignScan before starting a new
asynchronous/synchronous fetch by another ForeignScan using the same
connection.  Actually, that seems pretty reasonable to me, so I’d like
to use that part in your patch in the next version.  My concern is
that postgresIterateForeignScan() was modified to start another
asynchronous fetch from a remote table (if possible) right after doing
fetch_received_data() for the remote table, because aggressive
prefetching like that may increase the probability that ForeignScans
using the same connection conflict with each other, leading to a large
performance degradation.  (Another issue with that would be that the
fsstate->tuples array for the remote table may be enlarged
indefinitely.)

Whether the degradation is acceptable or not would depend on the user,
and needless to say, the smaller degradation would be more acceptable.
So I’ll update the patch using your patch without the
postgresIterateForeignScan() change.

> > In his proposal,
> > postgres_fdw was modified to perform prefetching pretty aggressively,
> > so I mean removing aggressive prefetching.  I think we could add it to
> > postgres_fdw later maybe as the server/table options.

> That was the natural extension from non-aggresive prefetching.

I also suppose that that would improve the performance in some cases.
Let’s leave that for future work.

> However, maybe we can live without that since if some needs more
> speed, it is enought to give every remote tables a dedicate
> connection.

Yeah, I think so too.

Thanks!

Best regards,
Etsuro Fujita

[1] 
https://www.postgresql.org/message-id/CAPmGK16E1erFV9STg8yokoewY6E-zEJtLzHUJcQx%2B3dyivCT%3DA%40mail.gmail.com




Re: Improvements and additions to COPY progress reporting

2021-02-19 Thread Michael Paquier
On Sat, Feb 20, 2021 at 11:39:22AM +0530, Bharath Rupireddy wrote:
> Actually in the code base the style of that variable declaration and
> usage of pgstat_progress_update_multi_param is a mix. For instance, in
> lazy_scan_heap, ReindexRelationConcurrently, the variables are
> declared at the start of the function. And in _bt_spools_heapscan,
> index_build, validate_index, perform_base_backup, the variables are
> declared within a separate block.

I think that we should encourage the use of
pgstat_progress_update_multi_param() where we can, as it makes
consistent the updates to all the parameters according to
st_changecount.  That's also usually cleaner to store all the
parameters that are changed if these are updated multiple times like
the REINDEX CONCURRENTLY ones.  The context of the code also matters,
of course.

Scanning through the patch set, 0002 is a good idea taken
independently.
--
Michael


signature.asc
Description: PGP signature


Re: New Table Access Methods for Multi and Single Inserts

2021-02-19 Thread Bharath Rupireddy
On Sat, Feb 20, 2021 at 12:53 PM Zhihong Yu  wrote:
>
> Hi,
> bq. case 3 - 2 integer(of 4 bytes each) columns, tuple size 32 bytes
>
> Is there some other column(s) per row apart from the integer columns ? Since 
> the 2 integer columns only occupy 8 bytes. I wonder where the other 32-8=24 
> bytes come from.

There are no other columns in the test case. Those 24 bytes are for
tuple header(23bytes) and 1 byte for other bookkeeping info. See
"Table Row Layout" from
https://www.postgresql.org/docs/devel/storage-page-layout.html.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com