date:20200921


On 21/09/2020 02:06, Tom Lane wrote:

Justin Pryzby  writes:

This also appears to break checksums.


Thanks, I'll go fix it.


I was wondering about that, because the typical pattern for use of
smgrextend for indexes seems to be

RelationOpenSmgr(rel);
PageSetChecksumInplace(page, lastblock);
smgrextend(rel->rd_smgr, MAIN_FORKNUM, lastblock, zerobuf.data, false);

and gist_indexsortbuild wasn't doing either of the first two things.

gist_indexsortbuild_flush_ready_pages looks like it might be
a few bricks shy of a load too.  But my local CLOBBER_CACHE_ALWAYS
run hasn't gotten to anything except the pretty-trivial index
made in point.sql, so I don't have evidence about it.


I don't think a relcache invalidation can happen on the index we're 
building. Other similar callers call RelationOpenSmgr(rel) before every 
write though (e.g. _bt_blwritepage()), so perhaps it's better to copy 
that pattern here too.



Another interesting point is that all the other index AMs seem to WAL-log
the new page before the smgrextend call, whereas this code is doing it
in the other order.  I strongly doubt that both patterns are equally
correct.  Could be that the other AMs are in the wrong though.


My thinking was that it's better to call smgrextend() first, so that if 
you run out of disk space, you get the error before WAL-logging it. That 
reduces the chance that WAL replay will run out of disk space. A lot of 
things are different during WAL replay, so it's quite likely that WAL 
replay runs out of disk space anyway if you're living on the edge, but 
still.


I didn't notice that the other callers are doing it the other way round, 
though. I think they need to, so that they can stamp the page with the 
LSN of the WAL record. But GiST build is special in that regard, because 
it stamps all pages with GistBuildLSN.


- Heikki

Re: Yet another fast GiST build


On 21/09/2020 11:08, Heikki Linnakangas wrote:

I think they need to, so that they can stamp the page with the LSN of
the WAL record. But GiST build is special in that regard, because it
stamps all pages with GistBuildLSN.


Actually, don't we have a problem with that, even before this patch? 
Even though we set the LSN to the magic GistBuildLSN value when we build 
the index, WAL replay will write the LSN of the record instead. That 
would mess with the LSN-NSN interlock. After WAL replay (or in a 
streaming replica), a scan on the GiST index might traverse right-links 
unnecessarily.


- Heikki

Re: Binaries on s390x arch

2020-09-21 Thread Christoph Berg

Re: Namrata Bhave
> As seen from downloads page, the Apt repo/rpms are not yet available for
> s390x for latest versions.

Hi,

are you asking about apt (.deb) or yum (.rpm) packages?

> Wanted to know if there is any work going on/planned to provide Postgres in
> ready-to-use package or installer form for s390x architecture?

s390x is fully supported as a Debian release architecture, so you can
get PostgreSQL and all packaged modules there (PG11 on Debian buster).

The apt team doesn't have plans for s390x yet on the PostgreSQL side.
If you are interested, are you in a position to donate a build host?

Christoph

Re: Yet another fast GiST build

2020-09-21 Thread Andrey M. Borodin




> 21 сент. 2020 г., в 13:45, Heikki Linnakangas  написал(а):
> 
> On 21/09/2020 11:08, Heikki Linnakangas wrote:
>> I think they need to, so that they can stamp the page with the LSN of
>> the WAL record. But GiST build is special in that regard, because it
>> stamps all pages with GistBuildLSN.
> 
> Actually, don't we have a problem with that, even before this patch? Even 
> though we set the LSN to the magic GistBuildLSN value when we build the 
> index, WAL replay will write the LSN of the record instead. That would mess 
> with the LSN-NSN interlock. After WAL replay (or in a streaming replica), a 
> scan on the GiST index might traverse right-links unnecessarily.

I think we don't set rightlinks during index build.

Best regards, Andrey Borodin.

Re: Is deduplicate_items ever used in nbtree?

2020-09-21 Thread Julien Rouhaud

Hi,

On Mon, Sep 21, 2020 at 2:21 PM Nikolay Shaplov  wrote:
>
> Hi!
>
> I am still working with reloptions patches, and noticed that in current master
> deduplicate_items option exists, but never actually used.
> You can set, it, you can dump it, it has proper representation in BTOptions,
> it is mentioned in documentation, but it is never actually used in the code.
> Or at least I did not managed to grep it.
>
> Is it the way it should be, or something needs fixing?

It's used via the BTGetDeduplicateItems() macro.

Logical replication from PG v13 and below to PG v14 (devel version) is not working.

Hi All,

Today, while exploring logical replication in PostgreSQL, I noticed
that logical replication from PG version 13 and below to PG v14
(development version) is not working. It has stopped working from the
following git commit onwards:

commit 464824323e57dc4b397e8b05854d779908b55304
Author: Amit Kapila 
Date:   Thu Sep 3 07:54:07 2020 +0530

Add support for streaming to built-in logical replication.

...
...

Here is the experiment that I performed to verify this:

Publisher (PG-v12/13):
==
CREATE TABLE pg2pg (id serial PRIMARY KEY, whatever text);

INSERT INTO pg2pg (whatever) SELECT 'str-' || i FROM generate_series(1, 5) i;

SELECT * FROM pg2pg;

CREATE PUBLICATION pg2pg_pub FOR TABLE pg2pg;

Subscriber (PG-v14 HEAD or commit 46482432):
=
CREATE TABLE pg2pg (id serial PRIMARY KEY, whatever text);

CREATE SUBSCRIPTION pg2pg_sub CONNECTION 'host=127.0.0.1 port=5433
dbname=postgres user=ashu' PUBLICATION pg2pg_pub;

SELECT * FROM pg2pg;

Above select query produces no result. When this experiment is
performed below the mentioned git commit, it works fine.

After spending some time looking into this issue, I observed that
above git commit has bumped the logical replication protocol version.
Due to this, the logical replication apply worker process is unable to
do WAL streaming which causes it to terminate. Therefore, the data
inserted in the publication table is not applied on the subscription
table (i.e. no data replication happens)

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: [HACKERS] logical decoding of two-phase transactions

On Mon, Sep 21, 2020 at 12:36 PM Ajin Cherian  wrote:
>
> > Why can't we call ReorderBufferCleanupTXN() from
> > ReorderBufferFinishPrepared after your changes?
> >
>
> Since the truncate already removed the changes, it would fail on the
> below Assert in ReorderBufferCleanupTXN()
> /* Check we're not mixing changes from different transactions. */
> Assert(change->txn == txn);
>

The changes list should be empty by that time because we removing each
change from the list:, see code "dlist_delete(&change->node);" in
ReorderBufferTruncateTXN. If you are hitting the Assert as you
mentioned then I think the problem is something else.

-- 
With Regards,
Amit Kapila.

Re: [HACKERS] logical decoding of two-phase transactions

On Mon, Sep 21, 2020 at 10:20 AM Amit Kapila  wrote:
>
> On Sun, Sep 20, 2020 at 11:01 AM Dilip Kumar  wrote:
> >
> > On Fri, Sep 18, 2020 at 6:02 PM Ajin Cherian  wrote:
> > >
> >
> > 3.
> >
> > + /*
> > + * If it's ROLLBACK PREPARED then handle it via callbacks.
> > + */
> > + if (TransactionIdIsValid(xid) &&
> > + !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
> > + parsed->dbId == ctx->slot->data.database &&
> > + !FilterByOrigin(ctx, origin_id) &&
> > + ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid))
> > + {
> > + ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
> > + commit_time, origin_id, origin_lsn,
> > + parsed->twophase_gid, false);
> > + return;
> > + }
> >
> >
> > I think we have already checked !SnapBuildXactNeedsSkip, parsed->dbId
> > == ctx->slot->data.database and !FilterByOrigin in DecodePrepare
> > so if those are not true then we wouldn't have prepared this
> > transaction i.e. ReorderBufferTxnIsPrepared will be false so why do we
> > need
> > to recheck this conditions.
> >
>
> Yeah, probably we should have Assert for below three conditions:
> + !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
> + parsed->dbId == ctx->slot->data.database &&
> + !FilterByOrigin(ctx, origin_id) &&

+1

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] logical decoding of two-phase transactions

2020-09-21 Thread Ajin Cherian

On Sun, Sep 20, 2020 at 3:31 PM Dilip Kumar  wrote:

> + /*
> + * If it's ROLLBACK PREPARED then handle it via callbacks.
> + */
> + if (TransactionIdIsValid(xid) &&
> + !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
> + parsed->dbId == ctx->slot->data.database &&
> + !FilterByOrigin(ctx, origin_id) &&
> + ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid))
> + {
> + ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
> + commit_time, origin_id, origin_lsn,
> + parsed->twophase_gid, false);
> + return;
> + }
>
>
> I think we have already checked !SnapBuildXactNeedsSkip, parsed->dbId
> == ctx->slot->data.database and !FilterByOrigin in DecodePrepare
> so if those are not true then we wouldn't have prepared this
> transaction i.e. ReorderBufferTxnIsPrepared will be false so why do we
> need
> to recheck this conditions.

We could enter DecodeAbort even without a prepare, as the code is
common for both XLOG_XACT_ABORT and XLOG_XACT_ABORT_PREPARED. So, the
conditions !SnapBuildXactNeedsSkip, parsed->dbId
> == ctx->slot->data.database and !FilterByOrigin could be true but the 
> transaction is not prepared, then we dont need to do a 
> ReorderBufferFinishPrepared (with commit flag false) but called 
> ReorderBufferAbort. But I think there is a problem, if those conditions are 
> in fact false, then we should return without trying to Abort using 
> ReorderBufferAbort, what do you think?

I agree with all your other comments.

regards,
Ajin
Fujitsu Australia

Re: VACUUM PARALLEL option vs. max_parallel_maintenance_workers

On Mon, Sep 21, 2020 at 12:45 PM Peter Eisentraut
 wrote:
>
> On 2020-09-21 05:48, Amit Kapila wrote:
> > What according to you should be the behavior here and how will it be
> > better than current?
>
> I think if I write VACUUM (PARALLEL 5), it should use up to 5 workers
> (up to the number of indexes), even if max_parallel_maintenance_workers
> is 2.
>

So you want it to disregard max_parallel_maintenance_workers but all
parallel operations have to regard one of the max_parallel_* option so
that it can respect max_parallel_workers beyond which the system won't
allow more parallel workers. Now, if we won't respect one of the
max_parallel_* option, it will unnecessarily try to register those
many workers even though it won't be able to start those many workers.
I think it is better to keep the limit for workers for scans and
maintenance operations separately so that the user is allowed to
perform different parallel operations in the system.

-- 
With Regards,
Amit Kapila.

Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.

2020-09-21 Thread Bharath Rupireddy

On Mon, Sep 21, 2020 at 3:26 PM Ashutosh Sharma  wrote:
>
> commit 464824323e57dc4b397e8b05854d779908b55304
> Author: Amit Kapila 
> Date:   Thu Sep 3 07:54:07 2020 +0530
>
> Above select query produces no result. When this experiment is
> performed below the mentioned git commit, it works fine.
>

I encountered the same issue. We get to see below error during table
sync phase in the subscriber server logs.

2020-09-21 16:01:45.794 IST [782428] LOG:  logical replication apply
worker for subscription "pg2pg_sub" has started
2020-09-21 16:01:45.799 IST [782428] ERROR:  could not start WAL
streaming: ERROR:  client sent proto_version=2 but we only support
protocol 1 or lower
CONTEXT:  slot "pg2pg_sub", output plugin "pgoutput", in the
startup callback
2020-09-21 16:01:45.800 IST [781607] LOG:  background worker "logical
replication worker" (PID 782428) exited with exit code 1

>
> After spending some time looking into this issue, I observed that
> above git commit has bumped the logical replication protocol version.
> Due to this, the logical replication apply worker process is unable to
> do WAL streaming which causes it to terminate. Therefore, the data
> inserted in the publication table is not applied on the subscription
> table (i.e. no data replication happens)
>

That's because the above commit changed LOGICALREP_PROTO_VERSION_NUM
to 2 from 1.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.

On Mon, Sep 21, 2020 at 3:26 PM Ashutosh Sharma  wrote:
>
> Hi All,
>
> Today, while exploring logical replication in PostgreSQL, I noticed
> that logical replication from PG version 13 and below to PG v14
> (development version) is not working. It has stopped working from the
> following git commit onwards:
>
> commit 464824323e57dc4b397e8b05854d779908b55304
> Author: Amit Kapila 
> Date:   Thu Sep 3 07:54:07 2020 +0530
>
> Add support for streaming to built-in logical replication.
>
> ...
> ...
>
> Here is the experiment that I performed to verify this:
>
> Publisher (PG-v12/13):
> ==
> CREATE TABLE pg2pg (id serial PRIMARY KEY, whatever text);
>
> INSERT INTO pg2pg (whatever) SELECT 'str-' || i FROM generate_series(1, 5) i;
>
> SELECT * FROM pg2pg;
>
> CREATE PUBLICATION pg2pg_pub FOR TABLE pg2pg;
>
> Subscriber (PG-v14 HEAD or commit 46482432):
> =
> CREATE TABLE pg2pg (id serial PRIMARY KEY, whatever text);
>
> CREATE SUBSCRIPTION pg2pg_sub CONNECTION 'host=127.0.0.1 port=5433
> dbname=postgres user=ashu' PUBLICATION pg2pg_pub;
>
> SELECT * FROM pg2pg;
>
> Above select query produces no result. When this experiment is
> performed below the mentioned git commit, it works fine.
>
> After spending some time looking into this issue, I observed that
> above git commit has bumped the logical replication protocol version.
> Due to this, the logical replication apply worker process is unable to
> do WAL streaming which causes it to terminate. Therefore, the data
> inserted in the publication table is not applied on the subscription
> table (i.e. no data replication happens)

Seems like this commit, should have only set the
LOGICALREP_PROTO_STREAM_VERSION_NUM to 2 but the
LOGICALREP_PROTO_VERSION_NUM shouln't have been changed.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [POC] Fast COPY FROM command for the table with foreign partitions

2020-09-21 Thread Andrey Lepikhov

This patch currently looks very ready for use. And I'm taking a close 
look at the error reporting. Here we have difference in behavior of 
local and foreign table:


regression test in postgres_fdw.sql:
copy rem2 from stdin;
-1  xyzzy
\.

reports error (1):
=
ERROR:  new row for relation "loc2" violates check constraint...
DETAIL:  Failing row contains (-1, xyzzy).
CONTEXT:  COPY loc2, line 1: "-1   xyzzy"
remote SQL command: COPY public.loc2(f1, f2) FROM STDIN
COPY rem2, line 2

But local COPY into loc2 reports another error (2):
===
copy loc2 from stdin;
ERROR:  new row for relation "loc2" violates check constraint...
DETAIL:  Failing row contains (-1, xyzzy).
CONTEXT:  COPY loc2, line 1: "-1   xyzzy"

Report (2) is shorter and more specific.
Report (1) contains meaningless information.

Maybe we need to improve error report? For example like this:
ERROR: Failed COPY into foreign table "rem2":
new row for relation "loc2" violates check constraint...
DETAIL:  Failing row contains (-1, xyzzy).
remote SQL command: COPY public.loc2(f1, f2) FROM STDIN
COPY rem2, line 1

The problem here is that we run into an error after the COPY FROM 
command completes. And we need to translate lineno from foreign server 
to lineno of overall COPY command.


--
regards,
Andrey Lepikhov
Postgres Professional

Re: [HACKERS] logical decoding of two-phase transactions

On Mon, Sep 21, 2020 at 3:45 PM Ajin Cherian  wrote:
>
> On Sun, Sep 20, 2020 at 3:31 PM Dilip Kumar  wrote:
>
> > + /*
> > + * If it's ROLLBACK PREPARED then handle it via callbacks.
> > + */
> > + if (TransactionIdIsValid(xid) &&
> > + !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
> > + parsed->dbId == ctx->slot->data.database &&
> > + !FilterByOrigin(ctx, origin_id) &&
> > + ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid))
> > + {
> > + ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
> > + commit_time, origin_id, origin_lsn,
> > + parsed->twophase_gid, false);
> > + return;
> > + }
> >
> >
> > I think we have already checked !SnapBuildXactNeedsSkip, parsed->dbId
> > == ctx->slot->data.database and !FilterByOrigin in DecodePrepare
> > so if those are not true then we wouldn't have prepared this
> > transaction i.e. ReorderBufferTxnIsPrepared will be false so why do we
> > need
> > to recheck this conditions.
>
> We could enter DecodeAbort even without a prepare, as the code is
> common for both XLOG_XACT_ABORT and XLOG_XACT_ABORT_PREPARED. So, the
> conditions !SnapBuildXactNeedsSkip, parsed->dbId
> > == ctx->slot->data.database and !FilterByOrigin could be true but the 
> > transaction is not prepared, then we dont need to do a 
> > ReorderBufferFinishPrepared (with commit flag false) but called 
> > ReorderBufferAbort. But I think there is a problem, if those conditions are 
> > in fact false, then we should return without trying to Abort using 
> > ReorderBufferAbort, what do you think?
>

I think we need to call ReorderBufferAbort at least to clean up the
TXN. Also, if what you are saying is correct then that should be true
without this patch as well, no? If so, we don't need to worry about it
as far as this patch is concerned.

-- 
With Regards,
Amit Kapila.

Re: pg_service.conf file with unknown parameters

> On 11 Sep 2020, at 14:39, Magnus Hagander  wrote:

> For example, if I have a service file with gssencmode=disable set, that 
> service file cannot be used by a psql client linked against libpq from 
> version 10. Even if the behavior would be identical (since v10 doesn't have 
> gssencmode).
> 
> Is there a particular reason we (1) refuse unknown parameters in the file,

The above case is a good example when silently ignoring would be beneficial.
We would however run the risk that someone has this in their service which is
silently ignored and fails to notice:

  ssl_mim_protocol_version=TLSv1.3

Not sure if that's worth the risk?  Halting on unknown parameters is also
consistent with postgresql.conf parsing etc (which isn't a clientside file I
know, but still relevant IMO).

> and (2) call it a "syntax error"?

That I agree with isn't entirely correct, the syntax is correct but the
parameter is unknown. Something along the following seems more correct:

  -  libpq_gettext("syntax error in service file \"%s\", line %d\n"),
  +  libpq_gettext("unknown parameter in service file \"%s\", line %d\n"),

> The documentation just says it's "INI format" file -- though in my experience 
> most other INI file parsers just ignore extra parameters included..

I don't think the INI file format is formally defined anywhere, but I do
believe it's considered to be strictly key-values (Wikipedia [0] agrees with
that).  Since we allow ldap configurations like the one below, it's technically
not INI format though:

  [service=mydb]
  ldap://127.0.0.1:1/foo?bar?lorem?ipsum

That might be borderline hairsplitting, but perhaps calling it INI format in
the docs isn't really helping?  Maybe we should reword that to say key/value or
something similar?

And this brings up an even more interesting case, the above will yield a syntax
error if libpq wasn't compiled with LDAP support, as opposed to other
parameters (like SSL* etc) which are ignored for builds not supporting them.
Is there a reason to treat ldap differently?

cheers ./daniel

[0] https://en.wikipedia.org/wiki/INI_file

Re: Planner, check if can use consider HASH for groupings (src/backend/optimizer/plan/planner.c)

2020-09-21 Thread Ranier Vilela

Em dom., 20 de set. de 2020 às 21:10, Tomas Vondra <
tomas.von...@2ndquadrant.com> escreveu:

> On Sun, Sep 20, 2020 at 08:09:56PM -0300, Ranier Vilela wrote:
> >Em sex., 18 de set. de 2020 às 10:37, Tomas Vondra <
> >tomas.von...@2ndquadrant.com> escreveu:
> >
> >> On Thu, Sep 17, 2020 at 06:31:12PM -0300, Ranier Vilela wrote:
> >> >Em qui., 17 de set. de 2020 às 17:09, Tomas Vondra <
> >> >tomas.von...@2ndquadrant.com> escreveu:
> >> >
> >> >> On Thu, Sep 17, 2020 at 02:12:12PM -0300, Ranier Vilela wrote:
> >> >> >Hi,
> >> >> >
> >> >> >In case gd->any_hashable is FALSE, grouping_is_hashable is never
> >> called.
> >> >> >In this case, the planner could use HASH for groupings, but will
> never
> >> >> know.
> >> >> >
> >> >>
> >> >> The condition is pretty simple - if the query has grouping sets,
> look at
> >> >> grouping sets, otherwise look at groupClause. For this to be an issue
> >> >> the query would need to have both grouping sets and (independent)
> group
> >> >> clause - which AFAIK is not possible.
> >> >>
> >> >Uh?
> >> >(parse->groupClause != NIL) If different from NIL we have
> ((independent)
> >> >group clause), grouping_is_hashable should check?
> >> >(gd ? gd->any_hashable : grouping_is_hashable(parse->groupClause
> >> >If gd is not NIL and gd->any_hashtable is FALSE?
> >> >
> >>
> >> Sorry, I'm not quite sure I understand what this is meant to say :-(
> >>
> >> Anyway, (groupClause != NIL) does not mean the groupClause is somehow
> >> independent (of what?). Add some debugging to create_grouping_paths and
> >> you'll see that e.g. this query ends up with groupClause with 3 items:
> >>
> >>  select 1 from t group by grouping sets ((a), (b), (c));
> >>
> >> and so does this one:
> >>
> >>  select 1 from t group by grouping sets ((a,c), (a,b));
> >>
> >> I'm no expert in grouping sets, but I'd bet this means we transform the
> >> grouping sets into a groupClause by extracting the keys. I haven't
> >> investigated why exactly we do this, but I'm sure there's a reason (e.g.
> >> it gives us SortGroupClause).
> >>
> >> You seem to believe a query can have both grouping sets and regular
> >> grouping at the same level - but how would such query look like? Surely
> >> you can't have two GROuP BY clauses. You can do
> >>
> >Translating into code:
> >gd is grouping sets and
> >parse->groupClause is regular grouping
> >and we cannot have both at the same time.
> >
>
> Have you verified the claim that we can't have both at the same time? As
> I already explained, we build groupClause from grouping sets. I suggest
> you do some debugging using the queries I used as examples, and you'll
> see the claim is not true.
>
I think we already agreed that we cannot have both at the same time.


>
> >
> >>  select 1 from t group by a, grouping sets ((b), (c));
> >>
> >> which is however mostly equivalent to (AFAICS)
> >>
> >>  select 1 from t group by grouping sets ((a,b), (a,c))
> >>
> >> so it's not like we have an independent groupClause either I think.
> >>
> >> The whole point of the original condition is - if there are grouping
> >> sets, check if at least one can be executed using hashing (i.e. all keys
> >> are hashable). Otherwise (without grouping sets) look at the grouping as
> >> a whole.
> >>
> >Or if gd is NULL check  parse->groupClause.
> >
>
> Which is what I'm saying.
>
> >
> >> I don't see how your change improves this - if there are grouping sets,
> >> it's futile to look at the whole groupClause if at least one grouping
> >> set can't be hashed.
> >>
> >> But there's a simple way to disprove this - show us a case (table and a
> >> query) where your code correctly allows hashing while the current one
> >> does not.
> >
> >I'm not an expert in grouping either.
> >The question I have here is whether gd is populated and has gd->
> >any_hashable as FALSE,
> >Its mean no use checking parse-> groupClause, it's a waste of time, Ok.
> >
>
> I'm sorry, I don't follow your logic. Those are two separate cases. If
> we have grouping sets, we have to check if at least one can be hashed.
> If we don't have grouping sets, we have to check groupClause directly.
> Why would that be a waste of time is unclear to me.
>
This is clear to me.
The problem is how it was implemented in create_grouping_paths.


>
> >
> >>
> >> >
> >> >> For hashing to be worth considering, at least one grouping set has
> to be
> >> >> hashable - otherwise it's pointless.
> >> >>
> >> >> Granted, you could have something like
> >> >>
> >> >>  GROUP BY GROUPING SETS ((a), (b)), c
> >> >>
> >> >> which I think essentially says "add c to every grouping set" and that
> >> >> will be covered by the any_hashable check.
> >> >>
> >> >I am not going into the merit of whether or not it is worth hashing.
> >> >grouping_is_hashable as a last resort, must decide.
> >> >
> >>
> >> I don't know what this is supposed to say either. The whole point of
> >> this check is to simply skip construction of hash-based paths in cases

Re: [HACKERS] logical decoding of two-phase transactions

2020-09-21 Thread Ajin Cherian

On Mon, Sep 21, 2020 at 9:24 PM Amit Kapila  wrote:

> I think we need to call ReorderBufferAbort at least to clean up the
> TXN. Also, if what you are saying is correct then that should be true
> without this patch as well, no? If so, we don't need to worry about it
> as far as this patch is concerned.

Yes, that is true. So will change this check to:

if (TransactionIdIsValid(xid) &&
ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid)

regards,
Ajin Cherian
Fujitsu Australia

Re: make MaxBackends available in _PG_init

2020-09-21 Thread Bharath Rupireddy

On Mon, Sep 21, 2020 at 9:14 AM Wang, Shenhao
 wrote:
>
> In source, I find that the postmaster will first load library, and then 
> calculate the value of MaxBackends.
>
> In the old version, the MaxBackends was calculated by:
>  MaxBackends = MaxConnections + autovacuum_max_workers + 1 +
> GetNumShmemAttachedBgworkers();
> Because any extension can register workers which will affect the return value 
> of GetNumShmemAttachedBgworkers.
> InitializeMaxBackends must be called after shared_preload_libraries. This is 
> also mentioned in comments.
>
> Now, function GetNumShmemAttachedBgworkers was deleted and replaced by guc 
> max_worker_processes,
> so if we changed the calling order like:
> Step1: calling InitializeMaxBackends.
> Step2: calling process_shared_preload_libraries
>

Yes, the GetNumShmemAttachedBgworkers() was removed by commit #
dfbba2c86cc8f09cf3ffca3d305b4ce54a7fb49a. ASAICS, changing the order
of InitializeMaxBackends() and process_shared_preload_libraries() has
no problem, as InitializeMaxBackends() doesn't calculate the
MaxBackends based on bgworker infra code, it does calculate based on
GUCs.

Having said that, I'm not quite sure whether any of the bgworker
registration code, for that matter process_shared_preload_libraries()
code path will somewhere use MaxBackends?

>
> In this order extension can get the correct value of MaxBackends in _PG_init.
>

Is there any specific use case that any of the _PG_init will use MaxBackends?

I think the InitializeMaxBackends() function comments still say as shown below.

 * This must be called after modules have had the chance to register background
 * workers in shared_preload_libraries, and before shared memory size is
 * determined.

What happens with your patch in EXEC_BACKEND cases? (On linux
EXEC_BACKEND (include/pg_config_manual.h) can be enabled for testing
purposes).

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] logical decoding of two-phase transactions

On Mon, Sep 21, 2020 at 5:23 PM Ajin Cherian  wrote:
>
> On Mon, Sep 21, 2020 at 9:24 PM Amit Kapila  wrote:
>
> > I think we need to call ReorderBufferAbort at least to clean up the
> > TXN. Also, if what you are saying is correct then that should be true
> > without this patch as well, no? If so, we don't need to worry about it
> > as far as this patch is concerned.
>
> Yes, that is true. So will change this check to:
>
> if (TransactionIdIsValid(xid) &&
> ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid)
>

Yeah and add the Assert for skip conditions as asked above.

-- 
With Regards,
Amit Kapila.

Lift line-length limit for pg_service.conf

The pg_service.conf parsing thread [0] made me realize that we have a hardwired
line length of max 256 bytes.  Lifting this would be in line with recent work
for ecpg, pg_regress and pg_hba (784b1ba1a2 and 8f8154a50).  The attached moves
pg_service.conf to use the new pg_get_line_append API and a StringInfo to lift
the restriction. Any reason not to do that while we're lifting other such 
limits?

cheers ./daniel



0001-Remove-arbitrary-line-length-in-pg_service.conf-pars.patch
Description: Binary data


[0] CABUevEw_RRRgG2uwsO7CD0TQf+Z=oR=S1=QyifV8D_5hatJ=o...@mail.gmail.com

Re: Yet another fast GiST build


On 21/09/2020 02:06, Tom Lane wrote:

Justin Pryzby  writes:

This also appears to break checksums.


Fixed, thanks for the report!


I was wondering about that, because the typical pattern for use of
smgrextend for indexes seems to be

RelationOpenSmgr(rel);
PageSetChecksumInplace(page, lastblock);
smgrextend(rel->rd_smgr, MAIN_FORKNUM, lastblock, zerobuf.data, false);

and gist_indexsortbuild wasn't doing either of the first two things.

gist_indexsortbuild_flush_ready_pages looks like it might be
a few bricks shy of a load too.  But my local CLOBBER_CACHE_ALWAYS
run hasn't gotten to anything except the pretty-trivial index
made in point.sql, so I don't have evidence about it.


I added a RelationOpenSmgr() call there too, although it's not needed 
currently. It seems to be enough to do it before the first smgrextend() 
call. But if you removed or refactored the first call someohow, so it 
was not the first call anymore, it would be easy to miss that you'd 
still need the RelationOpenSmgr() call there. It's more consistent with 
the code in nbtsort.c now, too.


- Heikki

Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.

On Mon, Sep 21, 2020 at 4:15 PM Dilip Kumar  wrote:
>
> On Mon, Sep 21, 2020 at 3:26 PM Ashutosh Sharma  wrote:
> >
> > Hi All,
> >
> > Today, while exploring logical replication in PostgreSQL, I noticed
> > that logical replication from PG version 13 and below to PG v14
> > (development version) is not working. It has stopped working from the
> > following git commit onwards:
> >
> > commit 464824323e57dc4b397e8b05854d779908b55304
> > Author: Amit Kapila 
> > Date:   Thu Sep 3 07:54:07 2020 +0530
> >
> > Add support for streaming to built-in logical replication.
> >
> > ...
> > ...
> >
> > Here is the experiment that I performed to verify this:
> >
> > Publisher (PG-v12/13):
> > ==
> > CREATE TABLE pg2pg (id serial PRIMARY KEY, whatever text);
> >
> > INSERT INTO pg2pg (whatever) SELECT 'str-' || i FROM generate_series(1, 5) 
> > i;
> >
> > SELECT * FROM pg2pg;
> >
> > CREATE PUBLICATION pg2pg_pub FOR TABLE pg2pg;
> >
> > Subscriber (PG-v14 HEAD or commit 46482432):
> > =
> > CREATE TABLE pg2pg (id serial PRIMARY KEY, whatever text);
> >
> > CREATE SUBSCRIPTION pg2pg_sub CONNECTION 'host=127.0.0.1 port=5433
> > dbname=postgres user=ashu' PUBLICATION pg2pg_pub;
> >
> > SELECT * FROM pg2pg;
> >
> > Above select query produces no result. When this experiment is
> > performed below the mentioned git commit, it works fine.
> >
> > After spending some time looking into this issue, I observed that
> > above git commit has bumped the logical replication protocol version.
> > Due to this, the logical replication apply worker process is unable to
> > do WAL streaming which causes it to terminate. Therefore, the data
> > inserted in the publication table is not applied on the subscription
> > table (i.e. no data replication happens)
>
> Seems like this commit, should have only set the
> LOGICALREP_PROTO_STREAM_VERSION_NUM to 2 but the
> LOGICALREP_PROTO_VERSION_NUM shouln't have been changed.

I think if we don't increase the LOGICALREP_PROTO_VERSION_NUM, then
streaming will not work in the latest version.  So what I feel is that
we can keep the LOGICALREP_PROTO_VERSION_NUM as 1 only add one more
parameter say LOGICALREP_PROTO_MAX_VERSION_NUM.  So now from the
latest subscriber if streaming is enabled then we can send
LOGICALREP_PROTO_STREAM_VERSION_NUM and LOGICALREP_PROTO_VERSION_NUM
otherwise.  And on publisher side we can check with the
max_protocol_version and min_protocol version.  I have attached a
patch for the changes I have explained.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


v1-0001-Bugfix-in-logical-protocol-version.patch
Description: Binary data

Re: Yet another fast GiST build


On 21/09/2020 12:06, Andrey M. Borodin wrote:

21 сент. 2020 г., в 13:45, Heikki Linnakangas 
написал(а):

Actually, don't we have a problem with that, even before this
patch? Even though we set the LSN to the magic GistBuildLSN value
when we build the index, WAL replay will write the LSN of the
record instead. That would mess with the LSN-NSN interlock. After
WAL replay (or in a streaming replica), a scan on the GiST index
might traverse right-links unnecessarily.


I think we don't set rightlinks during index build.


The new GiST sorting code does not, but the regular insert-based code does.

That's a bit questionable in the new code actually. Was that a conscious
decision? The right-links are only needed when there are concurrent page
splits, so I think it's OK, but the checks for InvalidBlockNumber in
gistScanPage() and gistFindPage() have comment "/* sanity check */".
Comment changes are needed, at least.

- Heikki

Re: PATCH: Batch/pipelining support for libpq

2020-09-21 Thread Dave Cramer

Alvaro,


On Fri, 4 Sep 2020 at 17:26, Alvaro Herrera 
wrote:

> On 2020-Aug-31, Matthieu Garrigues wrote:
>
> > It seems like this patch is nearly finished. I fixed all the remaining
> > issues. I'm also asking a confirmation of the test scenarios you want
> > to see in the next version of the patch.
>
> Hmm, apparently nobody noticed that this patch is not registered in the
> current commitfest.
>
> Since it was submitted ahead of the deadline, I'm going to add it there.
> (The patch has also undergone a lot of review already; it's certainly
> not a newcomer.)
>

I am looking for this in the commitfest and can't find it. However there is
an old commitfest entry

https://commitfest.postgresql.org/13/1024/

Do you have the link for the new one ?

Dave Cramer
www.postgres.rocks


>
>
>

Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.

Thanks Dilip for the patch. AFAIU, the fix looks good. One small comment:

In the error message we are still referring to the native protocol
version number. Shouldn't it be replaced with the greatest protocol
version number we support now (i.e. LOGICALREP_PROTO_MAX_VERSION_NUM)?

-   if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
+   if (data->protocol_version > LOGICALREP_PROTO_MAX_VERSION_NUM)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 errmsg("client sent proto_version=%d but we only
support protocol %d or lower",
data->protocol_version, LOGICALREP_PROTO_VERSION_NUM)));

Other than this, I don't have any comments.

On Mon, Sep 21, 2020 at 5:34 PM Dilip Kumar  wrote:
>
> On Mon, Sep 21, 2020 at 4:15 PM Dilip Kumar  wrote:
> >
> > On Mon, Sep 21, 2020 at 3:26 PM Ashutosh Sharma  
> > wrote:
> > >
> > > Hi All,
> > >
> > > Today, while exploring logical replication in PostgreSQL, I noticed
> > > that logical replication from PG version 13 and below to PG v14
> > > (development version) is not working. It has stopped working from the
> > > following git commit onwards:
> > >
> > > commit 464824323e57dc4b397e8b05854d779908b55304
> > > Author: Amit Kapila 
> > > Date:   Thu Sep 3 07:54:07 2020 +0530
> > >
> > > Add support for streaming to built-in logical replication.
> > >
> > > ...
> > > ...
> > >
> > > Here is the experiment that I performed to verify this:
> > >
> > > Publisher (PG-v12/13):
> > > ==
> > > CREATE TABLE pg2pg (id serial PRIMARY KEY, whatever text);
> > >
> > > INSERT INTO pg2pg (whatever) SELECT 'str-' || i FROM generate_series(1, 
> > > 5) i;
> > >
> > > SELECT * FROM pg2pg;
> > >
> > > CREATE PUBLICATION pg2pg_pub FOR TABLE pg2pg;
> > >
> > > Subscriber (PG-v14 HEAD or commit 46482432):
> > > =
> > > CREATE TABLE pg2pg (id serial PRIMARY KEY, whatever text);
> > >
> > > CREATE SUBSCRIPTION pg2pg_sub CONNECTION 'host=127.0.0.1 port=5433
> > > dbname=postgres user=ashu' PUBLICATION pg2pg_pub;
> > >
> > > SELECT * FROM pg2pg;
> > >
> > > Above select query produces no result. When this experiment is
> > > performed below the mentioned git commit, it works fine.
> > >
> > > After spending some time looking into this issue, I observed that
> > > above git commit has bumped the logical replication protocol version.
> > > Due to this, the logical replication apply worker process is unable to
> > > do WAL streaming which causes it to terminate. Therefore, the data
> > > inserted in the publication table is not applied on the subscription
> > > table (i.e. no data replication happens)
> >
> > Seems like this commit, should have only set the
> > LOGICALREP_PROTO_STREAM_VERSION_NUM to 2 but the
> > LOGICALREP_PROTO_VERSION_NUM shouln't have been changed.
>
> I think if we don't increase the LOGICALREP_PROTO_VERSION_NUM, then
> streaming will not work in the latest version.  So what I feel is that
> we can keep the LOGICALREP_PROTO_VERSION_NUM as 1 only add one more
> parameter say LOGICALREP_PROTO_MAX_VERSION_NUM.  So now from the
> latest subscriber if streaming is enabled then we can send
> LOGICALREP_PROTO_STREAM_VERSION_NUM and LOGICALREP_PROTO_VERSION_NUM
> otherwise.  And on publisher side we can check with the
> max_protocol_version and min_protocol version.  I have attached a
> patch for the changes I have explained.
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: Dynamic gathering the values for seq_page_cost/xxx_cost

2020-09-21 Thread Ashutosh Bapat

On Mon, Sep 21, 2020 at 9:11 AM Andy Fan  wrote:
>
> Here are some changes for my detection program.
>
> | | seq_read_lat (us) | 
> random_read_lat (us) |
> | FIO |12 |   
>148 |
> | Previous main.c |   8.5 |   
> 74 |
> | invalidate_device_cache before each testing | 9 |   
>150 |
> | prepare the test data file with O_DIRECT option |15 |   
>150 |
>
> In invalidate_device_cache, I just create another 1GB data file and read
> it. (see invalidate_device_cache function) this is similar as the previous 
> fio setup.
>
> prepare test data file with O_DIRECT option means in the past, I prepare the 
> test
> file with buffer IO. and before testing, I do invalidate device cache, file
> system cache. but the buffered prepared file still get better performance, I
> have no idea of it. Since I don't want any cache.  I use O_DIRECT
> option at last. The seq_read_lat changed from 9us to 15us.
> I still can't find out the 25% difference with the FIO result. (12 us vs 9 
> us).
>
> At last, the random_page_cost happens to not change very much.
>
> /u/y/g/fdirect> sudo ./main
> fs_cache_lat = 0.569031us, seq_read_lat = 18.901749us, random_page_lat = 
> 148.650589us
>
> cache hit ratio: 1.00 random_page_cost 1.00
> cache hit ratio: 0.90 random_page_cost 6.401019
> cache hit ratio: 0.50 random_page_cost 7.663772
> cache hit ratio: 0.10 random_page_cost 7.841498
> cache hit ratio: 0.00 random_page_cost 7.864383
>
> This result looks much different from "we should use 1.1 ~ 1.5 for SSD".
>

Very interesting. Thanks for working on this. In an earlier email you
mentioned that TPCH plans changed to efficient ones when you changed
random_page_cost = =8.6 from 4 and seq_page_cost was set to 1. IIUC,
setting random_page_cost to seq_page_cost to the same ratio as that
between the corresponding latencies improved the plans. How about
trying this with that ratio set to the one obtained from the latencies
provided by FIO? Do we see any better plans?

page cost is one thing, but there are CPU costs also involved in costs
of expression evaluation. Should those be changed accordingly to the
CPU latency?

-- 
Best Wishes,
Ashutosh Bapat

Re: PATCH: Batch/pipelining support for libpq

2020-09-21 Thread Dave Cramer

Dave Cramer
www.postgres.rocks


On Mon, 31 Aug 2020 at 11:46, Matthieu Garrigues <
matthieu.garrig...@gmail.com> wrote:

> Hi,
>
> It seems like this patch is nearly finished. I fixed all the remaining
> issues. I'm also asking
> a confirmation of the test scenarios you want to see in the next
> version of the patch.
>
> > Hi,
> >
> > On 2020-07-10 19:01:49 -0400, Alvaro Herrera wrote:
> > > Totally unasked for, here's a rebase of this patch series.  I didn't do
> > > anything other than rebasing to current master, solving a couple of
> very
> > > trivial conflicts, fixing some whitespace complaints by git apply, and
> > > running tests to verify everthing works.
> > >
> > > I don't foresee working on this at all, so if anyone is interested in
> > > seeing this feature in, I encourage them to read and address
> > > Horiguchi-san's feedback.
> >
> > Nor am I planning to do so, but I do think its a pretty important
> > improvement.
> >
> >
> Fixed
>
> >
> >
> > > +/*
> > > + * PQrecyclePipelinedCommand
> > > + * Push a command queue entry onto the freelist. It must be a
> dangling entry
> > > + * with null next pointer and not referenced by any other entry's
> next pointer.
> > > + */
> >
> > Dangling sounds a bit like it's already freed.
> >
> >
> Fixed
>
> >
> >
> > > +/*
> > > + * PQbatchSendQueue
> > > + * End a batch submission by sending a protocol sync. The connection
> will
> > > + * remain in batch mode and unavailable for new synchronous command
> execution
> > > + * functions until all results from the batch are processed by the
> client.
> >
> > I feel like the reference to the protocol sync is a bit too low level
> > for an external API. It should first document what the function does
> > from a user's POV.
> >
> > I think it'd also be good to document whether / whether not queries can
> > already have been sent before PQbatchSendQueue is called or not.
> >
> Fixed
>
> >
> >
> >
> > > + if (conn->batch_status == PQBATCH_MODE_ABORTED && conn->queryclass
> != PGQUERY_SYNC)
> > > + {
> > > + /*
> > > + * In an aborted batch we don't get anything from the server for each
> > > + * result; we're just discarding input until we get to the next sync
> > > + * from the server. The client needs to know its queries got aborted
> > > + * so we create a fake PGresult to return immediately from
> > > + * PQgetResult.
> > > + */
> > > + conn->result = PQmakeEmptyPGresult(conn,
> > > +   PGRES_BATCH_ABORTED);
> > > + if (!conn->result)
> > > + {
> > > + printfPQExpBuffer(&conn->errorMessage,
> > > +  libpq_gettext("out of memory"));
> > > + pqSaveErrorResult(conn);
> > > + return 0;
> >
> > Is there any way an application can recover at this point? ISTM we'd be
> > stuck in the previous asyncStatus, no?
> >
>
> conn->result is null when malloc(sizeof(PGresult)) returns null. It's
> very unlikely unless
> the server machine is out of memory, so the server will probably be
> unresponsive anyway.
>
> I'm leaving this as it is but if anyone has a solution simple to
> implement I'll fix it.
>
> >
> >
> > > +/* pqBatchFlush
> > > + * In batch mode, data will be flushed only when the out buffer
> reaches the threshold value.
> > > + * In non-batch mode, data will be flushed all the time.
> > > + */
> > > +static int
> > > +pqBatchFlush(PGconn *conn)
> > > +{
> > > + if ((conn->batch_status == PQBATCH_MODE_OFF)||(conn->outCount >=
> OUTBUFFER_THRESHOLD))
> > > + return(pqFlush(conn));
> > > + return 0; /* Just to keep compiler quiet */
> > > +}
> >
> > unnecessarily long line.
> >
> Fixed
>
> >
> > > +/*
> > > + * Connection's outbuffer threshold is set to 64k as it is safe
> > > + * in Windows as per comments in pqSendSome() API.
> > > + */
> > > +#define OUTBUFFER_THRESHOLD 65536
> >
> > I don't think the comment explains much. It's fine to send more than 64k
> > with pqSendSome(), they'll just be send with separate pgsecure_write()
> > invocations. And only on windows.
> >
> > It clearly makes sense to start sending out data at a certain
> > granularity to avoid needing unnecessary amounts of memory, and to make
> > more efficient use of latency / serer side compute.
> >
> > It's not implausible that 64k is the right amount for that, I just don't
> > think the explanation above is good.
> >
>
> Fixed
>
> > > diff --git a/src/test/modules/test_libpq/testlibpqbatch.c
> b/src/test/modules/test_libpq/testlibpqbatch.c
> > > new file mode 100644
> > > index 00..4d6ba266e5
> > > --- /dev/null
> > > +++ b/src/test/modules/test_libpq/testlibpqbatch.c
> > > @@ -0,0 +1,1456 @@
> > > +/*
> > > + * src/test/modules/test_libpq/testlibpqbatch.c
> > > + *
> > > + *
> > > + * testlibpqbatch.c
> > > + * Test of batch execution functionality
> > > + */
> > > +
> > > +#ifdef WIN32
> > > +#include 
> > > +#endif
> >
> > ISTM that this shouldn't be needed in a test program like this?
> > Shouldn't libpq abstract all of this away?
> >
>
> Fixed.
>
> >
> > > +static void
> > > +simple_batch(PGconn *conn)
> > > +{
> >
> > ISTM that

Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.

On Mon, Sep 21, 2020 at 6:27 PM Ashutosh Sharma  wrote:
>
> Thanks Dilip for the patch. AFAIU, the fix looks good. One small comment:
>
> In the error message we are still referring to the native protocol
> version number. Shouldn't it be replaced with the greatest protocol
> version number we support now (i.e. LOGICALREP_PROTO_MAX_VERSION_NUM)?
>
> -   if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
> +   if (data->protocol_version > LOGICALREP_PROTO_MAX_VERSION_NUM)
> ereport(ERROR,
> (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
>  errmsg("client sent proto_version=%d but we only
> support protocol %d or lower",
> data->protocol_version, LOGICALREP_PROTO_VERSION_NUM)));
>
> Other than this, I don't have any comments.

Thanks for the review.  I have fixed it the attached patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


v2-0001-Bugfix-in-logical-protocol-version.patch
Description: Binary data

Re: PATCH: Batch/pipelining support for libpq

On 2020-Sep-21, Dave Cramer wrote:

Hello Dave,

> I am looking for this in the commitfest and can't find it. However there is
> an old commitfest entry
> 
> https://commitfest.postgresql.org/13/1024/
> 
> Do you have the link for the new one ?

Here you go:

https://commitfest.postgresql.org/29/2724/

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PATCH: Batch/pipelining support for libpq

2020-09-21 Thread Matthieu Garrigues

Matthieu Garrigues

On Mon, Sep 21, 2020 at 3:09 PM Dave Cramer  wrote:
>>
> There was a comment upthread a while back that people should look at the 
> comments made in 
> https://www.postgresql.org/message-id/20180322.211148.187821341.horiguchi.kyotaro%40lab.ntt.co.jp
>  by Horiguchi-San.
>
> From what I can tell this has not been addressed. The one big thing is the 
> use of PQbatchProcessQueue vs just putting it in PQgetResult.
>
> The argument is that adding PQbatchProcessQueue is unnecessary and just adds 
> another step. Looking at this, it seems like putting this inside PQgetResult 
> would get my vote as it leaves the interface unchanged.
>

Ok. I'll merge PQbatchProcessQueue into PQgetResult. But just one
thing: I'll keep PQgetResult returning null between the result of two
batched query so the user
can know which result comes from which query.

Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.

On Mon, Sep 21, 2020 at 6:27 PM Ashutosh Sharma  wrote:
>
> Thanks Dilip for the patch. AFAIU, the fix looks good. One small comment:
>

Thanks Ashutosh and Dilip for working on this. I'll look into it in a
day or two.

-- 
With Regards,
Amit Kapila.

Re: Yet another fast GiST build

2020-09-21 Thread Andrey M. Borodin




> 21 сент. 2020 г., в 17:15, Heikki Linnakangas  написал(а):
> 
> On 21/09/2020 12:06, Andrey M. Borodin wrote
>> 
>> I think we don't set rightlinks during index build.
> 
> The new GiST sorting code does not, but the regular insert-based code does.
> 
> That's a bit questionable in the new code actually. Was that a conscious
> decision? The right-links are only needed when there are concurrent page
> splits, so I think it's OK, but the checks for InvalidBlockNumber in
> gistScanPage() and gistFindPage() have comment "/* sanity check */".
> Comment changes are needed, at least.

It was a conscious decision with incorrect motivation. I was thinking that it 
will help to reduce number of "false positive" inspecting right pages. But now 
I see that:
1. There should be no such "false positives" that we can avoid
2. Valid rightlinks could help to do amcheck verification in future

But thing that bothers me now: when we vacuum leaf page, we bump it's NSN. But 
we do not bump internal page LSN. Does this means we will follow rightlinks 
after vacuum? It seems superflous. And btw we did not adjust internal page 
tuples after vacuum...

Best regards, Andrey Borodin.

Re: Inconsistent Japanese name order in v13 contributors list

On 2020-Sep-18, Bruce Momjian wrote:

> This thread from 2015 is the most comprehensive discussion I remember of
> Japanese name ordering, including a suggestion to use small caps:
> 
>   
> https://www.postgresql.org/message-id/flat/20150613231826.GY133018%40postgresql.org#88d245a5cdd2b32e1e3e80fc07eab6f2
> 
> I have been following this guidance ever since.

Right.

About smallcaps, we didn't do it then because there was no way known to
us to use them in our then-current toolchain.  But we've changed now to
XML and apparently it is possible to use them -- we could try something
like  and define a CSS rule like

.caps_lastname {font-variant: small-caps;}

(Apparently you also need to set 'emphasis.propagates.style' to 1 in the
stylesheet).  This does it for HTML.  You also need some FO trick to
cover the PDF (probably something like what a042750646db did to change
catalog_table_entry formatting.)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PATCH: Batch/pipelining support for libpq

2020-09-21 Thread Dave Cramer

On Mon, 21 Sep 2020 at 09:21, Matthieu Garrigues <
matthieu.garrig...@gmail.com> wrote:

> Matthieu Garrigues
>
> On Mon, Sep 21, 2020 at 3:09 PM Dave Cramer 
> wrote:
> >>
> > There was a comment upthread a while back that people should look at the
> comments made in
> https://www.postgresql.org/message-id/20180322.211148.187821341.horiguchi.kyotaro%40lab.ntt.co.jp
> by Horiguchi-San.
> >
> > From what I can tell this has not been addressed. The one big thing is
> the use of PQbatchProcessQueue vs just putting it in PQgetResult.
> >
> > The argument is that adding PQbatchProcessQueue is unnecessary and just
> adds another step. Looking at this, it seems like putting this inside
> PQgetResult would get my vote as it leaves the interface unchanged.
> >
>
> Ok. I'll merge PQbatchProcessQueue into PQgetResult. But just one
> thing: I'll keep PQgetResult returning null between the result of two
> batched query so the user
> can know which result comes from which query.
>

Fair enough.

There may be other things in his comments that need to be addressed. That
was the big one that stuck out for me.

Thanks for working on this!


Dave Cramer
www.postgres.rocks

Re: Command statistics system (cmdstats)

On 2020-Sep-18, Michael Paquier wrote:

> Based on the low level of activity and the fact that the patch was
> marked as waiting on author for a couple of weeks, it looks like
> little could be achieved by the end of the CF, and the attention was
> elsewhere, so it looked better (and it still does IMO) to give more
> attention to the remaining 170~ patches that are still lying on the CF
> app.

"A couple of weeks" of inactivity is not sufficient, in my view, to boot
a patch out of the commitfest process.  Whenever the patch is
resurrected, it will be a new entry which won't have the history that it
had accumulated in the long time since it was created -- which biases
it against other new patches.

I'm not really arguing about this one patch only, but more generally
about the process you're following.  The problem is that you as CFM are
imposing your personal priorities on the whole process by punting
patches ahead of time -- you are saying that nobody else should be
looking at updating this patch because it is now closed.  It seems fair
to do so at the end of the commitfest, but it does not seem fair to do
it when it's still mid-cycle.

Putting also in perspective the history that the patch had prior to your
unfairly closing it, there was a lot of effort in getting it reviewed to
a good state and updated by the author --  yet a single unsubstantiated
comment, without itself a lot of effort on the reviewer's part, that
"maybe it has a perf drawback" is sufficient to get it booted?  It
doesn't seem appropriate.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PATCH: Batch/pipelining support for libpq

2020-09-21 Thread Matthieu Garrigues

On Mon, Sep 21, 2020 at 3:39 PM Dave Cramer  wrote:
>
>
>
> On Mon, 21 Sep 2020 at 09:21, Matthieu Garrigues 
>  wrote:
>>
>> Matthieu Garrigues
>>
>> On Mon, Sep 21, 2020 at 3:09 PM Dave Cramer  
>> wrote:
>> >>
>> > There was a comment upthread a while back that people should look at the 
>> > comments made in 
>> > https://www.postgresql.org/message-id/20180322.211148.187821341.horiguchi.kyotaro%40lab.ntt.co.jp
>> >  by Horiguchi-San.
>> >
>> > From what I can tell this has not been addressed. The one big thing is the 
>> > use of PQbatchProcessQueue vs just putting it in PQgetResult.
>> >
>> > The argument is that adding PQbatchProcessQueue is unnecessary and just 
>> > adds another step. Looking at this, it seems like putting this inside 
>> > PQgetResult would get my vote as it leaves the interface unchanged.
>> >
>>
>> Ok. I'll merge PQbatchProcessQueue into PQgetResult. But just one
>> thing: I'll keep PQgetResult returning null between the result of two
>> batched query so the user
>> can know which result comes from which query.
>
>
> Fair enough.
>
> There may be other things in his comments that need to be addressed. That was 
> the big one that stuck out for me.
>
> Thanks for working on this!
>

Yes I already addressed the other things in the v19 patch:
https://www.postgresql.org/message-id/flat/cajkzx4t5e-2cqe3dtv2r78dyfvz+in8py7a8marvlhs_pg7...@mail.gmail.com

Re: Yet another fast GiST build

2020-09-21 Thread Andrey M. Borodin



> 21 сент. 2020 г., в 18:29, Andrey M. Borodin  
> написал(а):
> 
> It was a conscious decision with incorrect motivation. I was thinking that it 
> will help to reduce number of "false positive" inspecting right pages. But 
> now I see that:
> 1. There should be no such "false positives" that we can avoid
> 2. Valid rightlinks could help to do amcheck verification in future

Well, point number 2 here is invalid. There exist one leaf page p, so that if 
we start traversing rightlink from p we will reach all leaf pages. But we 
practically have no means to find this page. This makes rightlinks not very 
helpful in amcheck for GiST.

But for consistency I think it worth to install them.

Thanks!

Best regards, Andrey Borodin.



0001-Install-rightlinks-on-GiST-pages-in-case-of-sorting-.patch
Description: Binary data

RE: Binaries on s390x arch

2020-09-21 Thread Namrata Bhave


Hi Christoph,

Thank you for your response.

We will be glad to obtain binaries for s390x on RHEL, SLES and Ubuntu
distros.

We are ready to provide the necessary infra. To enable this, we would need
more information about VM configuration(No of VMs, OS, vCPUs, memory,
Storage). Secondly T &C's would be needed to be signed electronically.
Please let me know if you are OK to proceed and we can communicate further.

Thanks and Regards,
Namrata




From:   Christoph Berg 
To: Namrata Bhave 
Cc: pgsql-hackers@lists.postgresql.org, Prasanna Kelkar

Date:   09/21/2020 02:31 PM
Subject:[EXTERNAL] Re: Binaries on s390x arch



Re: Namrata Bhave
> As seen from downloads page, the Apt repo/rpms are not yet available for
> s390x for latest versions.

Hi,

are you asking about apt (.deb) or yum (.rpm) packages?

> Wanted to know if there is any work going on/planned to provide Postgres
in
> ready-to-use package or installer form for s390x architecture?

s390x is fully supported as a Debian release architecture, so you can
get PostgreSQL and all packaged modules there (PG11 on Debian buster).

The apt team doesn't have plans for s390x yet on the PostgreSQL side.
If you are interested, are you in a position to donate a build host?

Christoph

Re: Yet another fast GiST build

Heikki Linnakangas  writes:
> On 21/09/2020 02:06, Tom Lane wrote:
>> Another interesting point is that all the other index AMs seem to WAL-log
>> the new page before the smgrextend call, whereas this code is doing it
>> in the other order.  I strongly doubt that both patterns are equally
>> correct.  Could be that the other AMs are in the wrong though.

> My thinking was that it's better to call smgrextend() first, so that if 
> you run out of disk space, you get the error before WAL-logging it. That 
> reduces the chance that WAL replay will run out of disk space. A lot of 
> things are different during WAL replay, so it's quite likely that WAL 
> replay runs out of disk space anyway if you're living on the edge, but 
> still.

Yeah.  access/transam/README points out that such failures need to be
planned for, and explains what we do for heap pages;

1. Adding a disk page to an existing table.

This action isn't WAL-logged at all.  We extend a table by writing a page
of zeroes at its end.  We must actually do this write so that we are sure
the filesystem has allocated the space.  If the write fails we can just
error out normally.  Once the space is known allocated, we can initialize
and fill the page via one or more normal WAL-logged actions.  Because it's
possible that we crash between extending the file and writing out the WAL
entries, we have to treat discovery of an all-zeroes page in a table or
index as being a non-error condition.  In such cases we can just reclaim
the space for re-use.

So GIST seems to be acting according to that design.  (Someday we need
to update this para to acknowledge that not all filesystems behave as
it's assuming.)

> I didn't notice that the other callers are doing it the other way round, 
> though. I think they need to, so that they can stamp the page with the 
> LSN of the WAL record. But GiST build is special in that regard, because 
> it stamps all pages with GistBuildLSN.

Kind of unpleasant; that means they risk what the README points out:

In all of these cases, if WAL replay fails to redo the original action
we must panic and abort recovery.  The DBA will have to manually clean up
(for instance, free up some disk space or fix directory permissions) and
then restart recovery.  This is part of the reason for not writing a WAL
entry until we've successfully done the original action.

I'm not sufficiently motivated to go and change it right now, though.

regards, tom lane

Re: Global snapshots

2020-09-21 Thread Alexey Kondratov

On 2020-09-18 00:54, Bruce Momjian wrote:

On Tue, Sep 8, 2020 at 01:36:16PM +0300, Alexey Kondratov wrote:

Thank you for the link!

After a quick look on the Sawada-san's patch set I think that there
are two

major differences:

1. There is a built-in foreign xacts resolver in the [1], which should
be
much more convenient from the end-user perspective. It involves huge
in-core

changes and additional complexity that is of course worth of.

However, it's still not clear for me that it is possible to resolve
all
foreign prepared xacts on the Postgres' own side with a 100%
guarantee.
Imagine a situation when the coordinator node is actually a HA cluster
group
(primary + sync + async replica) and it failed just after PREPARE
stage of

after local COMMIT. In that case all foreign xacts will be left in the
prepared state. After failover process complete synchronous replica
will
become a new primary. Would it have all required info to properly
resolve

orphan prepared xacts?

Probably, this situation is handled properly in the [1], but I've not
yet
finished a thorough reading of the patch set, though it has a great
doc!

On the other hand, previous 0003 and my proposed patch rely on either
manual
resolution of hung prepared xacts or usage of external
monitor/resolver.
This approach is much simpler from the in-core perspective, but
doesn't look

as complete as [1] though.

Have we considered how someone would clean up foreign transactions if
the
coordinating server dies? Could it be done manually? Would an
external

resolver, rather than an internal one, make this easier?

Both Sawada-san's patch [1] and in this thread (e.g. mine [2]) use 2PC
with a special gid format including a xid + server identification info.
Thus, one can select from pg_prepared_xacts, get xid and coordinator
info, then use txid_status() on the coordinator (or ex-coordinator) to
get transaction status and finally either commit or abort these stale
prepared xacts. Of course this could be wrapped into some user-level
support routines as it is done in the [1].

As for the benefits of using an external resolver, I think that there
are some of them from the whole system perspective:

1) If one follows the logic above, then this resolver could be
stateless, it takes all the required info from the Postgres nodes
themselves.

2) Then you can easily put it into container, which make it easier do
deploy to all these 'cloud' stuff like kubernetes.

3) Also you can scale resolvers independently from Postgres nodes.

I do not think that either of these points is a game changer, but we use
a very simple external resolver altogether with [2] in our sharding
prototype and it works just fine so far.

[1]
https://www.postgresql.org/message-id/CA%2Bfd4k4HOVqqC5QR4H984qvD0Ca9g%3D1oLYdrJT_18zP9t%2BUsJg%40mail.gmail.com

[2]
https://www.postgresql.org/message-id/3ef7877bfed0582019eab3d462a43275%40postgrespro.ru

--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

Re: recovering from "found xmin ... from before relfrozenxid ..."

Amit Kapila  writes:
> On Sun, Sep 20, 2020 at 10:43 PM Tom Lane  wrote:
>> AFAICS, there is no chance of the existing pg_surgery regression test
>> being fully stable if we don't fix both things.

> What if ensure that it runs with autovacuum = off and there is no
> parallel test running? I am not sure about the second part but if we
> can do that then the test will be probably stable.

Then it'll not be usable under "make installcheck", which is not
very nice.  It's also arguable that you aren't testing pg_surgery
under real-world conditions if you do it like that.

Moreover, I think that both of these points need to be addressed
anyway, as they represent bugs that are reachable independently
of pg_surgery.  Admittedly, we do not have a test case that
proves that the inconsistency between pruneheap and vacuum has
any bad effects in the absence of a7212be8b.  But do you really
want to bet that there are none?

regards, tom lane

PGXS testing upgrade paths

2020-09-21 Thread James Coleman

If there's a better list than this, please let me know, but I figured
hackers is appropriate since the extension building infrastructure is
documented in core.

While working on an in-house extension I realized that while PGXS
provides the standard regression test infrastructure, I'm not aware of
an automatic or standard way to test all upgrade paths provided by the
extension.

Is there something I'm missing? Or an approach people like (I suppose
the simplest way would be "manually" executing the upgrade files in a
regress test, but that seems tedious).

Thanks,
James

Re: Lift line-length limit for pg_service.conf

Daniel Gustafsson  writes:
> The pg_service.conf parsing thread [0] made me realize that we have a 
> hardwired
> line length of max 256 bytes.  Lifting this would be in line with recent work
> for ecpg, pg_regress and pg_hba (784b1ba1a2 and 8f8154a50).  The attached 
> moves
> pg_service.conf to use the new pg_get_line_append API and a StringInfo to lift
> the restriction. Any reason not to do that while we're lifting other such 
> limits?

+1.  I'd been thinking of looking around at our fgets calls to see
which ones need this sort of work, but didn't get to it yet.

Personally, I'd avoid depending on StringInfo.cursor here, as the
dependency isn't really buying you anything.  Instead you could do

line = linebuf.data;

just after the trim-trailing-whitespace loop and then leave the
"ignore leading whitespace too" code as it stands.

Also, the need for inserting the pfree into multiple exit paths kind
of makes me itch.  I wonder if we should change the ending code to
look like

exit:
fclose(f);
pfree(linebuf.data);

return result;

and then the early exit spots would be replaced with "result = x;
goto exit".  (Some of them could use "break", but not all, so
it's probably better to consistently use "goto".)

regards, tom lane

Re: PGXS testing upgrade paths

James Coleman  writes:
> If there's a better list than this, please let me know, but I figured
> hackers is appropriate since the extension building infrastructure is
> documented in core.

> While working on an in-house extension I realized that while PGXS
> provides the standard regression test infrastructure, I'm not aware of
> an automatic or standard way to test all upgrade paths provided by the
> extension.

The recommended way to deal with updates these days is to leave the
original extension script as-is and just write update scripts
(1.0--1.1, 1.1--1.2, etc).  That way, application of the updates
is tested automatically every time you do CREATE EXTENSION.

Now, if you also want to check that the intermediate states still
behave as intended, I don't see much of a solution that doesn't
involve custom test scaffolding.

regards, tom lane

Re: PGXS testing upgrade paths

2020-09-21 Thread James Coleman

On Mon, Sep 21, 2020 at 11:36 AM Tom Lane  wrote:
>
> James Coleman  writes:
> > If there's a better list than this, please let me know, but I figured
> > hackers is appropriate since the extension building infrastructure is
> > documented in core.
>
> > While working on an in-house extension I realized that while PGXS
> > provides the standard regression test infrastructure, I'm not aware of
> > an automatic or standard way to test all upgrade paths provided by the
> > extension.
>
> The recommended way to deal with updates these days is to leave the
> original extension script as-is and just write update scripts
> (1.0--1.1, 1.1--1.2, etc).  That way, application of the updates
> is tested automatically every time you do CREATE EXTENSION.

Ah, so just don't add a new 1.2 file, etc.

That also implies not having more direct upgrade paths (e.g., 1.0--1.2)?

> Now, if you also want to check that the intermediate states still
> behave as intended, I don't see much of a solution that doesn't
> involve custom test scaffolding.

Yeah, I'm not so much concerned about intermediate states so much as
multiple upgrade paths and/or multiple single-version install files
(which you replied to already above).

James

Re: PGXS testing upgrade paths

James Coleman  writes:
> On Mon, Sep 21, 2020 at 11:36 AM Tom Lane  wrote:
>> The recommended way to deal with updates these days is to leave the
>> original extension script as-is and just write update scripts
>> (1.0--1.1, 1.1--1.2, etc).  That way, application of the updates
>> is tested automatically every time you do CREATE EXTENSION.

> Ah, so just don't add a new 1.2 file, etc.

> That also implies not having more direct upgrade paths (e.g., 1.0--1.2)?

Right.  Once the accumulation of cruft starts to impact install time
substantially, maybe you want to roll things up to a new base version.
But at least with the contrib modules we've found that it's seldom
worth the trouble.

regards, tom lane

Re: Planner, check if can use consider HASH for groupings (src/backend/optimizer/plan/planner.c)

2020-09-21 Thread Tomas Vondra


On Mon, Sep 21, 2020 at 08:32:35AM -0300, Ranier Vilela wrote:

Em dom., 20 de set. de 2020 às 21:10, Tomas Vondra <
tomas.von...@2ndquadrant.com> escreveu:


On Sun, Sep 20, 2020 at 08:09:56PM -0300, Ranier Vilela wrote:
>Em sex., 18 de set. de 2020 às 10:37, Tomas Vondra <
>tomas.von...@2ndquadrant.com> escreveu:
>
>> On Thu, Sep 17, 2020 at 06:31:12PM -0300, Ranier Vilela wrote:
>> >Em qui., 17 de set. de 2020 às 17:09, Tomas Vondra <
>> >tomas.von...@2ndquadrant.com> escreveu:
>> >
>> >> On Thu, Sep 17, 2020 at 02:12:12PM -0300, Ranier Vilela wrote:
>> >> >Hi,
>> >> >
>> >> >In case gd->any_hashable is FALSE, grouping_is_hashable is never
>> called.
>> >> >In this case, the planner could use HASH for groupings, but will
never
>> >> know.
>> >> >
>> >>
>> >> The condition is pretty simple - if the query has grouping sets,
look at
>> >> grouping sets, otherwise look at groupClause. For this to be an issue
>> >> the query would need to have both grouping sets and (independent)
group
>> >> clause - which AFAIK is not possible.
>> >>
>> >Uh?
>> >(parse->groupClause != NIL) If different from NIL we have
((independent)
>> >group clause), grouping_is_hashable should check?
>> >(gd ? gd->any_hashable : grouping_is_hashable(parse->groupClause
>> >If gd is not NIL and gd->any_hashtable is FALSE?
>> >
>>
>> Sorry, I'm not quite sure I understand what this is meant to say :-(
>>
>> Anyway, (groupClause != NIL) does not mean the groupClause is somehow
>> independent (of what?). Add some debugging to create_grouping_paths and
>> you'll see that e.g. this query ends up with groupClause with 3 items:
>>
>>  select 1 from t group by grouping sets ((a), (b), (c));
>>
>> and so does this one:
>>
>>  select 1 from t group by grouping sets ((a,c), (a,b));
>>
>> I'm no expert in grouping sets, but I'd bet this means we transform the
>> grouping sets into a groupClause by extracting the keys. I haven't
>> investigated why exactly we do this, but I'm sure there's a reason (e.g.
>> it gives us SortGroupClause).
>>
>> You seem to believe a query can have both grouping sets and regular
>> grouping at the same level - but how would such query look like? Surely
>> you can't have two GROuP BY clauses. You can do
>>
>Translating into code:
>gd is grouping sets and
>parse->groupClause is regular grouping
>and we cannot have both at the same time.
>

Have you verified the claim that we can't have both at the same time? As
I already explained, we build groupClause from grouping sets. I suggest
you do some debugging using the queries I used as examples, and you'll
see the claim is not true.


I think we already agreed that we cannot have both at the same time.



No, we haven't. I think we've agreed that you can't have both a group by
clause with grouping sets and then another group by clause with "plain"
grouping. But even with grouping sets you'll have groupClause generated
from the grouping sets. See preprocess_grouping_sets.





>
>>  select 1 from t group by a, grouping sets ((b), (c));
>>
>> which is however mostly equivalent to (AFAICS)
>>
>>  select 1 from t group by grouping sets ((a,b), (a,c))
>>
>> so it's not like we have an independent groupClause either I think.
>>
>> The whole point of the original condition is - if there are grouping
>> sets, check if at least one can be executed using hashing (i.e. all keys
>> are hashable). Otherwise (without grouping sets) look at the grouping as
>> a whole.
>>
>Or if gd is NULL check  parse->groupClause.
>

Which is what I'm saying.

>
>> I don't see how your change improves this - if there are grouping sets,
>> it's futile to look at the whole groupClause if at least one grouping
>> set can't be hashed.
>>
>> But there's a simple way to disprove this - show us a case (table and a
>> query) where your code correctly allows hashing while the current one
>> does not.
>
>I'm not an expert in grouping either.
>The question I have here is whether gd is populated and has gd->
>any_hashable as FALSE,
>Its mean no use checking parse-> groupClause, it's a waste of time, Ok.
>

I'm sorry, I don't follow your logic. Those are two separate cases. If
we have grouping sets, we have to check if at least one can be hashed.
If we don't have grouping sets, we have to check groupClause directly.
Why would that be a waste of time is unclear to me.


This is clear to me.
The problem is how it was implemented in create_grouping_paths.



You haven't demonstrated what the problem is, though. Showing us a query
that fails would make it very clear.





>
>>
>> >
>> >> For hashing to be worth considering, at least one grouping set has
to be
>> >> hashable - otherwise it's pointless.
>> >>
>> >> Granted, you could have something like
>> >>
>> >>  GROUP BY GROUPING SETS ((a), (b)), c
>> >>
>> >> which I think essentially says "add c to every grouping set" and that
>> >> will be covered by the any_hashable check.
>> >>
>> >I am not going into the merit of whether or not it is worth hash

Re: doc review for v13

Justin Pryzby  writes:
> Thanks.  Here's the remainder, with some new ones.

LGTM.  I tweaked one or two places a bit more, and pushed it.

regards, tom lane

Re: Planner, check if can use consider HASH for groupings (src/backend/optimizer/plan/planner.c)

2020-09-21 Thread Ranier Vilela

Em seg., 21 de set. de 2020 às 13:37, Tomas Vondra <
tomas.von...@2ndquadrant.com> escreveu:

> On Mon, Sep 21, 2020 at 08:32:35AM -0300, Ranier Vilela wrote:
> >Em dom., 20 de set. de 2020 às 21:10, Tomas Vondra <
> >tomas.von...@2ndquadrant.com> escreveu:
> >
> >> On Sun, Sep 20, 2020 at 08:09:56PM -0300, Ranier Vilela wrote:
> >> >Em sex., 18 de set. de 2020 às 10:37, Tomas Vondra <
> >> >tomas.von...@2ndquadrant.com> escreveu:
> >> >
> >> >> On Thu, Sep 17, 2020 at 06:31:12PM -0300, Ranier Vilela wrote:
> >> >> >Em qui., 17 de set. de 2020 às 17:09, Tomas Vondra <
> >> >> >tomas.von...@2ndquadrant.com> escreveu:
> >> >> >
> >> >> >> On Thu, Sep 17, 2020 at 02:12:12PM -0300, Ranier Vilela wrote:
> >> >> >> >Hi,
> >> >> >> >
> >> >> >> >In case gd->any_hashable is FALSE, grouping_is_hashable is never
> >> >> called.
> >> >> >> >In this case, the planner could use HASH for groupings, but will
> >> never
> >> >> >> know.
> >> >> >> >
> >> >> >>
> >> >> >> The condition is pretty simple - if the query has grouping sets,
> >> look at
> >> >> >> grouping sets, otherwise look at groupClause. For this to be an
> issue
> >> >> >> the query would need to have both grouping sets and (independent)
> >> group
> >> >> >> clause - which AFAIK is not possible.
> >> >> >>
> >> >> >Uh?
> >> >> >(parse->groupClause != NIL) If different from NIL we have
> >> ((independent)
> >> >> >group clause), grouping_is_hashable should check?
> >> >> >(gd ? gd->any_hashable : grouping_is_hashable(parse->groupClause
> >> >> >If gd is not NIL and gd->any_hashtable is FALSE?
> >> >> >
> >> >>
> >> >> Sorry, I'm not quite sure I understand what this is meant to say :-(
> >> >>
> >> >> Anyway, (groupClause != NIL) does not mean the groupClause is somehow
> >> >> independent (of what?). Add some debugging to create_grouping_paths
> and
> >> >> you'll see that e.g. this query ends up with groupClause with 3
> items:
> >> >>
> >> >>  select 1 from t group by grouping sets ((a), (b), (c));
> >> >>
> >> >> and so does this one:
> >> >>
> >> >>  select 1 from t group by grouping sets ((a,c), (a,b));
> >> >>
> >> >> I'm no expert in grouping sets, but I'd bet this means we transform
> the
> >> >> grouping sets into a groupClause by extracting the keys. I haven't
> >> >> investigated why exactly we do this, but I'm sure there's a reason
> (e.g.
> >> >> it gives us SortGroupClause).
> >> >>
> >> >> You seem to believe a query can have both grouping sets and regular
> >> >> grouping at the same level - but how would such query look like?
> Surely
> >> >> you can't have two GROuP BY clauses. You can do
> >> >>
> >> >Translating into code:
> >> >gd is grouping sets and
> >> >parse->groupClause is regular grouping
> >> >and we cannot have both at the same time.
> >> >
> >>
> >> Have you verified the claim that we can't have both at the same time? As
> >> I already explained, we build groupClause from grouping sets. I suggest
> >> you do some debugging using the queries I used as examples, and you'll
> >> see the claim is not true.
> >>
> >I think we already agreed that we cannot have both at the same time.
> >
>
> No, we haven't. I think we've agreed that you can't have both a group by
> clause with grouping sets and then another group by clause with "plain"
> grouping. But even with grouping sets you'll have groupClause generated
> from the grouping sets. See preprocess_grouping_sets.
>
> >
> >>
> >> >
> >> >>  select 1 from t group by a, grouping sets ((b), (c));
> >> >>
> >> >> which is however mostly equivalent to (AFAICS)
> >> >>
> >> >>  select 1 from t group by grouping sets ((a,b), (a,c))
> >> >>
> >> >> so it's not like we have an independent groupClause either I think.
> >> >>
> >> >> The whole point of the original condition is - if there are grouping
> >> >> sets, check if at least one can be executed using hashing (i.e. all
> keys
> >> >> are hashable). Otherwise (without grouping sets) look at the
> grouping as
> >> >> a whole.
> >> >>
> >> >Or if gd is NULL check  parse->groupClause.
> >> >
> >>
> >> Which is what I'm saying.
> >>
> >> >
> >> >> I don't see how your change improves this - if there are grouping
> sets,
> >> >> it's futile to look at the whole groupClause if at least one grouping
> >> >> set can't be hashed.
> >> >>
> >> >> But there's a simple way to disprove this - show us a case (table
> and a
> >> >> query) where your code correctly allows hashing while the current one
> >> >> does not.
> >> >
> >> >I'm not an expert in grouping either.
> >> >The question I have here is whether gd is populated and has gd->
> >> >any_hashable as FALSE,
> >> >Its mean no use checking parse-> groupClause, it's a waste of time, Ok.
> >> >
> >>
> >> I'm sorry, I don't follow your logic. Those are two separate cases. If
> >> we have grouping sets, we have to check if at least one can be hashed.
> >> If we don't have grouping sets, we have to check groupClause directly.
> >> Why would that be a waste of time is unclear to

Re: Planner, check if can use consider HASH for groupings (src/backend/optimizer/plan/planner.c)

2020-09-21 Thread Tomas Vondra


On Sun, Sep 20, 2020 at 01:50:34PM -0300, Ranier Vilela wrote:

Em seg., 21 de set. de 2020 às 13:37, Tomas Vondra <
tomas.von...@2ndquadrant.com> escreveu:


On Mon, Sep 21, 2020 at 08:32:35AM -0300, Ranier Vilela wrote:
>Em dom., 20 de set. de 2020 às 21:10, Tomas Vondra <
>tomas.von...@2ndquadrant.com> escreveu:
>
>> On Sun, Sep 20, 2020 at 08:09:56PM -0300, Ranier Vilela wrote:
>> >Em sex., 18 de set. de 2020 às 10:37, Tomas Vondra <
>> >tomas.von...@2ndquadrant.com> escreveu:
>> >
>> >> On Thu, Sep 17, 2020 at 06:31:12PM -0300, Ranier Vilela wrote:
>> >> >Em qui., 17 de set. de 2020 às 17:09, Tomas Vondra <
>> >> >tomas.von...@2ndquadrant.com> escreveu:
>> >> >
>> >> >> On Thu, Sep 17, 2020 at 02:12:12PM -0300, Ranier Vilela wrote:
>> >> >> >Hi,
>> >> >> >
>> >> >> >In case gd->any_hashable is FALSE, grouping_is_hashable is never
>> >> called.
>> >> >> >In this case, the planner could use HASH for groupings, but will
>> never
>> >> >> know.
>> >> >> >
>> >> >>
>> >> >> The condition is pretty simple - if the query has grouping sets,
>> look at
>> >> >> grouping sets, otherwise look at groupClause. For this to be an
issue
>> >> >> the query would need to have both grouping sets and (independent)
>> group
>> >> >> clause - which AFAIK is not possible.
>> >> >>
>> >> >Uh?
>> >> >(parse->groupClause != NIL) If different from NIL we have
>> ((independent)
>> >> >group clause), grouping_is_hashable should check?
>> >> >(gd ? gd->any_hashable : grouping_is_hashable(parse->groupClause
>> >> >If gd is not NIL and gd->any_hashtable is FALSE?
>> >> >
>> >>
>> >> Sorry, I'm not quite sure I understand what this is meant to say :-(
>> >>
>> >> Anyway, (groupClause != NIL) does not mean the groupClause is somehow
>> >> independent (of what?). Add some debugging to create_grouping_paths
and
>> >> you'll see that e.g. this query ends up with groupClause with 3
items:
>> >>
>> >>  select 1 from t group by grouping sets ((a), (b), (c));
>> >>
>> >> and so does this one:
>> >>
>> >>  select 1 from t group by grouping sets ((a,c), (a,b));
>> >>
>> >> I'm no expert in grouping sets, but I'd bet this means we transform
the
>> >> grouping sets into a groupClause by extracting the keys. I haven't
>> >> investigated why exactly we do this, but I'm sure there's a reason
(e.g.
>> >> it gives us SortGroupClause).
>> >>
>> >> You seem to believe a query can have both grouping sets and regular
>> >> grouping at the same level - but how would such query look like?
Surely
>> >> you can't have two GROuP BY clauses. You can do
>> >>
>> >Translating into code:
>> >gd is grouping sets and
>> >parse->groupClause is regular grouping
>> >and we cannot have both at the same time.
>> >
>>
>> Have you verified the claim that we can't have both at the same time? As
>> I already explained, we build groupClause from grouping sets. I suggest
>> you do some debugging using the queries I used as examples, and you'll
>> see the claim is not true.
>>
>I think we already agreed that we cannot have both at the same time.
>

No, we haven't. I think we've agreed that you can't have both a group by
clause with grouping sets and then another group by clause with "plain"
grouping. But even with grouping sets you'll have groupClause generated
from the grouping sets. See preprocess_grouping_sets.

>
>>
>> >
>> >>  select 1 from t group by a, grouping sets ((b), (c));
>> >>
>> >> which is however mostly equivalent to (AFAICS)
>> >>
>> >>  select 1 from t group by grouping sets ((a,b), (a,c))
>> >>
>> >> so it's not like we have an independent groupClause either I think.
>> >>
>> >> The whole point of the original condition is - if there are grouping
>> >> sets, check if at least one can be executed using hashing (i.e. all
keys
>> >> are hashable). Otherwise (without grouping sets) look at the
grouping as
>> >> a whole.
>> >>
>> >Or if gd is NULL check  parse->groupClause.
>> >
>>
>> Which is what I'm saying.
>>
>> >
>> >> I don't see how your change improves this - if there are grouping
sets,
>> >> it's futile to look at the whole groupClause if at least one grouping
>> >> set can't be hashed.
>> >>
>> >> But there's a simple way to disprove this - show us a case (table
and a
>> >> query) where your code correctly allows hashing while the current one
>> >> does not.
>> >
>> >I'm not an expert in grouping either.
>> >The question I have here is whether gd is populated and has gd->
>> >any_hashable as FALSE,
>> >Its mean no use checking parse-> groupClause, it's a waste of time, Ok.
>> >
>>
>> I'm sorry, I don't follow your logic. Those are two separate cases. If
>> we have grouping sets, we have to check if at least one can be hashed.
>> If we don't have grouping sets, we have to check groupClause directly.
>> Why would that be a waste of time is unclear to me.
>>
>This is clear to me.
>The problem is how it was implemented in create_grouping_paths.
>

You haven't demonstrated what the problem is, though. Showing us a query
that fails would

Re: PATCH: Batch/pipelining support for libpq

2020-09-21 Thread Matthieu Garrigues

Hi Dave,
I merged PQbatchProcessQueue into PQgetResult.
One first init call to PQbatchProcessQueue was also required in
PQsendQueue to have
PQgetResult ready to read the first batch query.

Tests and documentation are updated accordingly.

Matthieu Garrigues

On Mon, Sep 21, 2020 at 3:39 PM Dave Cramer  wrote:
>
>
>
> On Mon, 21 Sep 2020 at 09:21, Matthieu Garrigues 
>  wrote:
>>
>> Matthieu Garrigues
>>
>> On Mon, Sep 21, 2020 at 3:09 PM Dave Cramer  
>> wrote:
>> >>
>> > There was a comment upthread a while back that people should look at the 
>> > comments made in 
>> > https://www.postgresql.org/message-id/20180322.211148.187821341.horiguchi.kyotaro%40lab.ntt.co.jp
>> >  by Horiguchi-San.
>> >
>> > From what I can tell this has not been addressed. The one big thing is the 
>> > use of PQbatchProcessQueue vs just putting it in PQgetResult.
>> >
>> > The argument is that adding PQbatchProcessQueue is unnecessary and just 
>> > adds another step. Looking at this, it seems like putting this inside 
>> > PQgetResult would get my vote as it leaves the interface unchanged.
>> >
>>
>> Ok. I'll merge PQbatchProcessQueue into PQgetResult. But just one
>> thing: I'll keep PQgetResult returning null between the result of two
>> batched query so the user
>> can know which result comes from which query.
>
>
> Fair enough.
>
> There may be other things in his comments that need to be addressed. That was 
> the big one that stuck out for me.
>
> Thanks for working on this!
>
>
> Dave Cramer
> www.postgres.rocks
diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index 92556c7ce0..15d0c03c89 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -4829,6 +4829,465 @@ int PQflush(PGconn *conn);
 
  
 
+ 
+  Batch mode and query pipelining
+
+  
+   libpq
+   batch mode
+  
+
+  
+   libpq
+   pipelining
+  
+
+  
+   libpq supports queueing up queries into
+   a pipeline to be executed as a batch on the server. Batching queries allows
+   applications to avoid a client/server round-trip after each query to get
+   the results before issuing the next query.
+  
+
+  
+   When to use batching
+
+   
+Much like asynchronous query mode, there is no performance disadvantage to
+using batching and pipelining. It increases client application complexity
+and extra caution is required to prevent client/server deadlocks but
+can sometimes offer considerable performance improvements.
+   
+
+   
+Batching is most useful when the server is distant, i.e. network latency
+(ping time) is high, and when many small operations are being performed in
+rapid sequence. There is usually less benefit in using batches when each
+query takes many multiples of the client/server round-trip time to execute.
+A 100-statement operation run on a server 300ms round-trip-time away would take
+30 seconds in network latency alone without batching; with batching it may spend
+as little as 0.3s waiting for results from the server.
+   
+
+   
+Use batches when your application does lots of small
+INSERT, UPDATE and
+DELETE operations that can't easily be transformed into
+operations on sets or into a
+COPY operation.
+   
+
+   
+Batching is not useful when information from one operation is required by the
+client before it knows enough to send the next operation. The client must
+introduce a synchronisation point and wait for a full client/server
+round-trip to get the results it needs. However, it's often possible to
+adjust the client design to exchange the required information server-side.
+Read-modify-write cycles are especially good candidates; for example:
+
+ BEGIN;
+ SELECT x FROM mytable WHERE id = 42 FOR UPDATE;
+ -- result: x=2
+ -- client adds 1 to x:
+ UPDATE mytable SET x = 3 WHERE id = 42;
+ COMMIT;
+
+could be much more efficiently done with:
+
+ UPDATE mytable SET x = x + 1 WHERE id = 42;
+
+   
+
+   
+
+ The batch API was introduced in PostgreSQL 10.0, but clients using PostgresSQL 10.0 version of libpq can
+ use batches on server versions 7.4 and newer. Batching works on any server
+ that supports the v3 extended query protocol.
+
+   
+
+  
+
+  
+   Using batch mode
+
+   
+To issue batches the application must switch
+a connection into batch mode. Enter batch mode with PQenterBatchMode(conn) or test
+whether batch mode is active with PQbatchStatus(conn). In batch mode only asynchronous operations are permitted, and
+COPY is not recommended as it most likely will trigger failure in batch processing. 
+Using any synchronous command execution functions such as PQfn,
+PQexec or one of its sibling functions are error conditions.
+Functions allowed in batch mode are described in . 
+   
+
+   
+The client uses libpq's asynchronous query functions to dispatch work,
+marking the end of each batch with PQbatchSendQueue.
+And

Re: Planner, check if can use consider HASH for groupings (src/backend/optimizer/plan/planner.c)

2020-09-21 Thread Ranier Vilela

Em seg., 21 de set. de 2020 às 14:24, Tomas Vondra <
tomas.von...@2ndquadrant.com> escreveu:

> On Sun, Sep 20, 2020 at 01:50:34PM -0300, Ranier Vilela wrote:
> >Em seg., 21 de set. de 2020 às 13:37, Tomas Vondra <
> >tomas.von...@2ndquadrant.com> escreveu:
> >
> >> On Mon, Sep 21, 2020 at 08:32:35AM -0300, Ranier Vilela wrote:
> >> >Em dom., 20 de set. de 2020 às 21:10, Tomas Vondra <
> >> >tomas.von...@2ndquadrant.com> escreveu:
> >> >
> >> >> On Sun, Sep 20, 2020 at 08:09:56PM -0300, Ranier Vilela wrote:
> >> >> >Em sex., 18 de set. de 2020 às 10:37, Tomas Vondra <
> >> >> >tomas.von...@2ndquadrant.com> escreveu:
> >> >> >
> >> >> >> On Thu, Sep 17, 2020 at 06:31:12PM -0300, Ranier Vilela wrote:
> >> >> >> >Em qui., 17 de set. de 2020 às 17:09, Tomas Vondra <
> >> >> >> >tomas.von...@2ndquadrant.com> escreveu:
> >> >> >> >
> >> >> >> >> On Thu, Sep 17, 2020 at 02:12:12PM -0300, Ranier Vilela wrote:
> >> >> >> >> >Hi,
> >> >> >> >> >
> >> >> >> >> >In case gd->any_hashable is FALSE, grouping_is_hashable is
> never
> >> >> >> called.
> >> >> >> >> >In this case, the planner could use HASH for groupings, but
> will
> >> >> never
> >> >> >> >> know.
> >> >> >> >> >
> >> >> >> >>
> >> >> >> >> The condition is pretty simple - if the query has grouping
> sets,
> >> >> look at
> >> >> >> >> grouping sets, otherwise look at groupClause. For this to be an
> >> issue
> >> >> >> >> the query would need to have both grouping sets and
> (independent)
> >> >> group
> >> >> >> >> clause - which AFAIK is not possible.
> >> >> >> >>
> >> >> >> >Uh?
> >> >> >> >(parse->groupClause != NIL) If different from NIL we have
> >> >> ((independent)
> >> >> >> >group clause), grouping_is_hashable should check?
> >> >> >> >(gd ? gd->any_hashable :
> grouping_is_hashable(parse->groupClause
> >> >> >> >If gd is not NIL and gd->any_hashtable is FALSE?
> >> >> >> >
> >> >> >>
> >> >> >> Sorry, I'm not quite sure I understand what this is meant to say
> :-(
> >> >> >>
> >> >> >> Anyway, (groupClause != NIL) does not mean the groupClause is
> somehow
> >> >> >> independent (of what?). Add some debugging to
> create_grouping_paths
> >> and
> >> >> >> you'll see that e.g. this query ends up with groupClause with 3
> >> items:
> >> >> >>
> >> >> >>  select 1 from t group by grouping sets ((a), (b), (c));
> >> >> >>
> >> >> >> and so does this one:
> >> >> >>
> >> >> >>  select 1 from t group by grouping sets ((a,c), (a,b));
> >> >> >>
> >> >> >> I'm no expert in grouping sets, but I'd bet this means we
> transform
> >> the
> >> >> >> grouping sets into a groupClause by extracting the keys. I haven't
> >> >> >> investigated why exactly we do this, but I'm sure there's a reason
> >> (e.g.
> >> >> >> it gives us SortGroupClause).
> >> >> >>
> >> >> >> You seem to believe a query can have both grouping sets and
> regular
> >> >> >> grouping at the same level - but how would such query look like?
> >> Surely
> >> >> >> you can't have two GROuP BY clauses. You can do
> >> >> >>
> >> >> >Translating into code:
> >> >> >gd is grouping sets and
> >> >> >parse->groupClause is regular grouping
> >> >> >and we cannot have both at the same time.
> >> >> >
> >> >>
> >> >> Have you verified the claim that we can't have both at the same
> time? As
> >> >> I already explained, we build groupClause from grouping sets. I
> suggest
> >> >> you do some debugging using the queries I used as examples, and
> you'll
> >> >> see the claim is not true.
> >> >>
> >> >I think we already agreed that we cannot have both at the same time.
> >> >
> >>
> >> No, we haven't. I think we've agreed that you can't have both a group by
> >> clause with grouping sets and then another group by clause with "plain"
> >> grouping. But even with grouping sets you'll have groupClause generated
> >> from the grouping sets. See preprocess_grouping_sets.
> >>
> >> >
> >> >>
> >> >> >
> >> >> >>  select 1 from t group by a, grouping sets ((b), (c));
> >> >> >>
> >> >> >> which is however mostly equivalent to (AFAICS)
> >> >> >>
> >> >> >>  select 1 from t group by grouping sets ((a,b), (a,c))
> >> >> >>
> >> >> >> so it's not like we have an independent groupClause either I
> think.
> >> >> >>
> >> >> >> The whole point of the original condition is - if there are
> grouping
> >> >> >> sets, check if at least one can be executed using hashing (i.e.
> all
> >> keys
> >> >> >> are hashable). Otherwise (without grouping sets) look at the
> >> grouping as
> >> >> >> a whole.
> >> >> >>
> >> >> >Or if gd is NULL check  parse->groupClause.
> >> >> >
> >> >>
> >> >> Which is what I'm saying.
> >> >>
> >> >> >
> >> >> >> I don't see how your change improves this - if there are grouping
> >> sets,
> >> >> >> it's futile to look at the whole groupClause if at least one
> grouping
> >> >> >> set can't be hashed.
> >> >> >>
> >> >> >> But there's a simple way to disprove this - show us a case (table
> >> and a
> >> >> >> query) where your code correctly allows hashing while the curre

Re: WIP: BRIN multi-range indexes

2020-09-21 Thread John Naylor

On Fri, Sep 18, 2020 at 6:27 PM Tomas Vondra
 wrote:

> But maybe we could still use this scheme by actually computing
>
> h1 = hash_uint32_extended(value, seed1);
> h2 = hash_uint32_extended(value, seed2);
>
> and then use this as the independent hash functions. I think that would
> meet the requirements of the paper.

Yeah, that would work algorithmically. It would be trivial to add to
the patch, too of course. There'd be a higher up-front cpu cost. Also,
I'm a bit cautious of rehashing hashes, and whether the two values
above are independent enough. I'm not sure either of these points
matters. My guess is the partition approach is more sound, but it has
some minor organizational challenges (see below).

> Why would 11k bits be more than we'd like to store? Assuming we could
> use the whole 8kB page for the bloom filter, that'd be about 64k bits.
> In practice there'd be a bit of overhead (page header ...) but it's
> still much more than 11k bits.

Brain fade -- I guess I thought we were avoiding being toasted, but
now I see that's not possible for BRIN storage. So, we'll want to
guard against this:

ERROR:  index row size 8160 exceeds maximum 8152 for index "bar_num_idx"

While playing around with the numbers I had an epiphany: At the
defaults, the filter already takes up ~4.3kB, over half the page.
There is no room for another tuple, so if we're only indexing one
column, we might as well take up the whole page. Here MT = max tuples
per 128 8k pages, or 37120, so default ndistinct is 3712.

n  k  m  p
MT/10  7  35580  0.01
MT/10  7  64000  0.0005
MT/10  12 64000  0.00025

Assuming ndistinct isn't way off reality, we get 20x-40x lower false
positive rate almost for free, and it'd be trivial to code! Keeping k
at 7 would be more robust, since it's equivalent to starting with n =
~6000, p = 0.006, which is still almost 2x less false positives than
you asked for. It also means nearly doubling the number of sorted
values before switching.

Going the other direction, capping nbits to 64k bits when ndistinct
gets too high, the false positive rate we can actually support starts
to drop. Here, the user requested 0.001 fpr.

n   k  p
45009  0.001
60007  0.006
75006  0.017
15000   3  0.129  (probably useless by now)
MT  1  0.440
64000   1  0.63   (possible with > 128 pages per range)

I imagine smaller pages_per_range settings are going to be useful for
skinny tables (note to self -- test). Maybe we could provide a way for
the user to see that their combination of pages_per_range, false
positive rate, and ndistinct is supportable, like
brin_bloom_get_supported_fpr(). Or document to check with
page_inspect. And that's not even considering multi-column indexes,
like you mentioned.

> But I guess we can simply make the table
> of primes a bit larger, right?

If we want to support all the above cases without falling down
entirely, it would have to go up to 32k to be safe (When k = 1 we
could degenerate to one modulus on the whole filter). That would be a
table of about 7kB, which we could binary search. [thinks for a
moment]...Actually, if we wanted to be clever, maybe we could
precalculate the primes needed for the 64k bit cases and stick them at
the end of the array. The usual algorithm will find them. That way, we
could keep the array around 2kB. However, for >8kB block size, we
couldn't adjust the 64k number, which might be okay, but not really
project policy.

We could also generate the primes via a sieve instead, which is really
fast (and more code). That would be more robust, but that would
require the filter to store the actual primes used, so 20 more bytes
at max k = 10. We could hard-code space for that, or to keep from
hard-coding maximum k and thus lowest possible false positive rate,
we'd need more bookkeeping.

So, with the partition approach, we'd likely have to set in stone
either max nbits, or min target false positive rate. The latter option
seems more principled, not only for the block size, but also since the
target fp rate is already fixed by the reloption, and as I've
demonstrated, we can often go above and beyond the reloption even
without changing k.

> >One wrinkle is that the sum of k primes is not going to match m
> >exactly. If the sum is too small, we can trim a few bits off of the
> >filter bitmap. If the sum is too large, the last partition can spill
> >into the front of the first one. This shouldn't matter much in the
> >common case since we need to round m to the nearest byte anyway.
> >
>
> AFAIK the paper simply says that as long as the sum of partitions is
> close to the requested nbits, it's good enough. So I guess we could just
> roll with that, no need to trim/wrap or something like that.

Hmm, I'm not sure I understand you. I can see not caring to trim
wasted bits, but we can't set/read off the end of the filter. If we
don't wrap, we could just skip reading/writing that bit. So a tiny
portion of access would be at k - 1. The paper is not clear

Re: pgindent vs dtrace on macos

I wrote:
> We still have to deal with
> src/backend/utils/sort/qsort_tuple.c
> src/pl/plpgsql/src/plerrcodes.h
> src/pl/plpython/spiexceptions.h
> src/pl/tcl/pltclerrcodes.h
> if we want to be entirely clean about this.

I took care of those, so I think we're done here.

regards, tom lane

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Sun, Sep 20, 2020 at 1:13 PM Tom Lane  wrote:
> 1. My patch a7212be8b does indeed have a problem.  It will allow
> vacuum_set_xid_limits to compute freezeLimit = nextXid for a temp
> table if freeze_min_age is zero (ie VACUUM FREEZE).  If there's
> any concurrent transactions, this falls foul of
> heap_prepare_freeze_tuple's expectation that
>
>  * NB: cutoff_xid *must* be <= the current global xmin, to ensure that any
>  * XID older than it could neither be running nor seen as running by any
>  * open transaction.  This ensures that the replacement will not change
>  * anyone's idea of the tuple state.
>
> The "cannot freeze committed xmax" error message appears to be banking on
> the assumption that we'd not reach heap_prepare_freeze_tuple for any
> committed-dead tuple unless its xmax is past the specified cutoff_xid.
>
> 2. As Amit suspected, there's an inconsistency between pruneheap.c's
> rules for which tuples are removable and vacuum.c's rules for that.
> This seems like a massive bug in its own right: what's the point of
> pruneheap.c going to huge effort to decide whether it should keep a
> tuple if vacuum will then kill it anyway?  I do not understand why
> whoever put in the GlobalVisState stuff only applied it in pruneheap.c
> and not VACUUM proper.

I am not sure I fully understand why you're contrasting pruneheap.c
with vacuum here, because vacuum just does HOT pruning to remove dead
tuples - maybe calling the relevant functions with different
arguments, but it doesn't have its own independent logic for that.

The key point is that the freezing code isn't, or at least
historically wasn't, very smart about dead tuples. For example, I
think if you told it to freeze something that was dead it would just
do it, which is obviously bad. And that's why Andres stuck those
sanity checks in there. But it's still pretty fragile. I think perhaps
the pruning code should be rewritten in such a way that it can be
combined with the code that freezes and marks pages all-visible, so
that there's not so much action at a distance, but such an endeavor is
in itself pretty scary, and certainly not back-patchable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Improper use about DatumGetInt32

On Sun, Sep 20, 2020 at 9:17 PM Hou, Zhijie  wrote:
> In (/contrib/bloom/blutils.c:277), I found it use DatumGetInt32 to get UInt32 
> type.
> Is it more appropriate to use DatumGetUInt32 here?

Typically, the DatumGetBlah() function that you pick should match the
SQL data type that the function is returning. So if the function
returns pg_catalog.int4, which corresponds to the C data type int32,
you would use DatumGetInt32. There is no SQL type corresponding to the
C data type uint32, so I'm not sure why we even have DatumGetUInt32.
I'm sort of suspicious that there's some fuzzy thinking going on
there.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pgindent vs dtrace on macos

> On 21 Sep 2020, at 19:59, Tom Lane  wrote:
> 
> I wrote:
>> We still have to deal with
>> src/backend/utils/sort/qsort_tuple.c
>> src/pl/plpgsql/src/plerrcodes.h
>> src/pl/plpython/spiexceptions.h
>> src/pl/tcl/pltclerrcodes.h
>> if we want to be entirely clean about this.
> 
> I took care of those, so I think we're done here.

Aha, thanks!  They were still on my TODO but I got distracted by a shiny object
or two.  Thanks for fixing.

cheers ./daniel

Re: OpenSSL 3.0.0 compatibility

> On 19 Sep 2020, at 04:11, Michael Paquier  wrote:
> On Fri, Sep 18, 2020 at 04:11:13PM +0200, Daniel Gustafsson wrote:

>> The other problem was that the cipher context
>> padding must be explicitly set, whereas in previous versions relying on the
>> default worked fine.  EVP_CIPHER_CTX_set_padding always returns 1 so thats 
>> why
>> it isn't checking the returnvalue as the other nearby initialization calls.
> 
> It seems to me that it would be a good idea to still check for the
> return value of EVP_CIPHER_CTX_set_padding() and just return with
> a PXE_CIPHER_INIT.  By looking at the upstream code, it is true that
> it always returns true for <= 1.1.1, but that's not the case for
> 3.0.0.  Some code paths of upstream also check after it.

Thats a good point, it's now provider dependent and can thus fail in case the
provider isn't supplying the functionality.  We've already loaded a provider
where we know the call is supported, but it's of course better to check.  Fixed
in the attached v2.

I was only reading the docs and not the code, and they haven't been updated to
reflect this.  I'll open a PR to the OpenSSL devs to fix that.

> Also, what's the impact with disabling the padding for <= 1.1.1?  This
> part of the upstream code is still a bit obscure to me..

I'm far from fluent in OpenSSL, but AFAICT it has to do with the new provider
API.  The default value for padding is unchanged, but it needs to be propaged
into the provider to be set in the context where the old code picked it up
automatically.  The relevant OpenSSL commit (83f68df32f0067ee7b0) which changes
the docs to say the function should be called doesn't contain enough
information to explain why.

>> To avoid problems with the by LibreSSL overloaded OPENSSL_VERSION_NUMBER 
>> macro
>> (which too is deprecated in 3.0.0), I used the new macro which is only set in
>> 3.0.0. Not sure if that's considered acceptable or if we should invent our 
>> own
>> version macro in autoconf.
> 
> OSSL_PROVIDER_load() is new as of 3.0.0, so using a configure switch
> similarly as what we do for the other functions should be more
> consistent and enough, no?

Good point, fixed in the attached.

>> For the main SSL tests, the incorrect password test has a new errormessage
>> which is added in 0002.
> 
> Hmm.  I am linking to a build of alpha6 here, but I still see the
> error being reported as a bad decrypt for this test.  Interesting. 

Turns out it was coming from when I tested against OpenSSL git HEAD, so it may
come in alpha7 (or whatever the next version will be).  Let's disregard this
for now until dust settles, I've dropped the patch from the series.

cheers ./daniel

0001-Fix-OpenSSL-3-support-in-pgcrypto-v2.patch
Description: Binary data

Re: recovering from "found xmin ... from before relfrozenxid ..."

Robert Haas  writes:
> On Sun, Sep 20, 2020 at 1:13 PM Tom Lane  wrote:
>> 2. As Amit suspected, there's an inconsistency between pruneheap.c's
>> rules for which tuples are removable and vacuum.c's rules for that.
>> This seems like a massive bug in its own right: what's the point of
>> pruneheap.c going to huge effort to decide whether it should keep a
>> tuple if vacuum will then kill it anyway?  I do not understand why
>> whoever put in the GlobalVisState stuff only applied it in pruneheap.c
>> and not VACUUM proper.

> I am not sure I fully understand why you're contrasting pruneheap.c
> with vacuum here, because vacuum just does HOT pruning to remove dead
> tuples - maybe calling the relevant functions with different
> arguments, but it doesn't have its own independent logic for that.

Right, but what we end up with is that the very same tuple xmin and
xmax might result in pruning/deletion, or not, depending on whether
it's part of a HOT chain or not.  That's at best pretty weird, and
at worst it means that corner-case bugs in other places are triggered
in only one of the two scenarios ... which is what we have here.

> The key point is that the freezing code isn't, or at least
> historically wasn't, very smart about dead tuples. For example, I
> think if you told it to freeze something that was dead it would just
> do it, which is obviously bad. And that's why Andres stuck those
> sanity checks in there. But it's still pretty fragile. I think perhaps
> the pruning code should be rewritten in such a way that it can be
> combined with the code that freezes and marks pages all-visible, so
> that there's not so much action at a distance, but such an endeavor is
> in itself pretty scary, and certainly not back-patchable.

Not sure.  The pruning code is trying to serve two masters, that is
both VACUUM and on-the-fly cleanup during ordinary queries.  If you
try to merge it with other tasks that VACUUM does, you're going to
have a mess for the second usage.  I fear there's going to be pretty
strong conservation of cruft either way.

FWIW, weakening the sanity checks in heap_prepare_freeze_tuple is
*not* my preferred fix here.  But it'll take some work in other
places to preserve them.

regards, tom lane

Re: Improper use about DatumGetInt32

Robert Haas  writes:
> Typically, the DatumGetBlah() function that you pick should match the
> SQL data type that the function is returning. So if the function
> returns pg_catalog.int4, which corresponds to the C data type int32,
> you would use DatumGetInt32. There is no SQL type corresponding to the
> C data type uint32, so I'm not sure why we even have DatumGetUInt32.

xid?

regards, tom lane

Re: pgindent vs dtrace on macos

Oh wait, I forgot about the fmgrprotos.h discrepancy.

I wrote:
> It strikes me that a low-cost workaround would be to rename these
> C functions.  There's no law that their C names must match the
> SQL names.

Here's a proposed patch to fix it that way.

regards, tom lane

diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index 27d65557df..2cc2da60eb 100644
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -1109,7 +1109,7 @@ numeric_support(PG_FUNCTION_ARGS)
  *	scale of the attribute have to be applied on the value.
  */
 Datum
-numeric		(PG_FUNCTION_ARGS)
+numeric_apply_typmod(PG_FUNCTION_ARGS)
 {
 	Numeric		num = PG_GETARG_NUMERIC(0);
 	int32		typmod = PG_GETARG_INT32(1);
diff --git a/src/backend/utils/adt/oracle_compat.c b/src/backend/utils/adt/oracle_compat.c
index 76e666474e..79ba28671f 100644
--- a/src/backend/utils/adt/oracle_compat.c
+++ b/src/backend/utils/adt/oracle_compat.c
@@ -923,7 +923,7 @@ ascii(PG_FUNCTION_ARGS)
  /
 
 Datum
-chr			(PG_FUNCTION_ARGS)
+chr_code_to_text(PG_FUNCTION_ARGS)
 {
 	uint32		cvalue = PG_GETARG_UINT32(0);
 	text	   *result;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index f48f5fb4d9..5eb293ee8f 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -3399,7 +3399,7 @@
   prosrc => 'ascii' },
 { oid => '1621', descr => 'convert int4 to char',
   proname => 'chr', prorettype => 'text', proargtypes => 'int4',
-  prosrc => 'chr' },
+  prosrc => 'chr_code_to_text' },
 { oid => '1622', descr => 'replicate string n times',
   proname => 'repeat', prorettype => 'text', proargtypes => 'text int4',
   prosrc => 'repeat' },
@@ -4210,7 +4210,8 @@
   proargtypes => 'internal', prosrc => 'numeric_support' },
 { oid => '1703', descr => 'adjust numeric to typmod precision/scale',
   proname => 'numeric', prosupport => 'numeric_support',
-  prorettype => 'numeric', proargtypes => 'numeric int4', prosrc => 'numeric' },
+  prorettype => 'numeric', proargtypes => 'numeric int4',
+  prosrc => 'numeric_apply_typmod' },
 { oid => '1704',
   proname => 'numeric_abs', prorettype => 'numeric', proargtypes => 'numeric',
   prosrc => 'numeric_abs' },

Re: pgindent vs dtrace on macos

On 2020-Sep-21, Tom Lane wrote:

> Oh wait, I forgot about the fmgrprotos.h discrepancy.
> 
> I wrote:
> > It strikes me that a low-cost workaround would be to rename these
> > C functions.  There's no law that their C names must match the
> > SQL names.
> 
> Here's a proposed patch to fix it that way.

pgtypes_numeric.h still contains

typedef struct
{
int ndigits;/* number of digits in digits[] - can be 0! */
int weight; /* weight of first digit */
int rscale; /* result scale */
int dscale; /* display scale */
int sign;   /* NUMERIC_POS, NUMERIC_NEG, or NUMERIC_NAN */
NumericDigit *buf;  /* start of alloc'd space for digits[] */
NumericDigit *digits;   /* decimal digits */
} numeric;

... isn't this more likely to create a typedef entry than merely a
function name?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: pgindent vs dtrace on macos

Alvaro Herrera  writes:
> On 2020-Sep-21, Tom Lane wrote:
>> Here's a proposed patch to fix it that way.

> pgtypes_numeric.h still contains
> typedef struct
> {
> } numeric;

> ... isn't this more likely to create a typedef entry than merely a
> function name?

Well, yeah, it *is* a typedef.  My proposal is to rename the C function
to avoid the conflict, rather than renaming the typedef.  Given the
small number of direct calls (none), that's a lot less work.  Also,
I think pgtypes_numeric.h is exposed to ecpg client code, so changing
that typedef's name could be quite problematic.

regards, tom lane

Re: pgindent vs dtrace on macos

On 2020-Sep-21, Tom Lane wrote:

> > ... isn't this more likely to create a typedef entry than merely a
> > function name?
> 
> Well, yeah, it *is* a typedef.  My proposal is to rename the C function
> to avoid the conflict, rather than renaming the typedef.  Given the
> small number of direct calls (none), that's a lot less work.  Also,
> I think pgtypes_numeric.h is exposed to ecpg client code, so changing
> that typedef's name could be quite problematic.

Ah, of course.

The idea of adding the names to pgindent's %blacklist results in severe
uglification, particularly in the regex code, so +1 for your workaround.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Improper use about DatumGetInt32

Hi,

On 2020-09-21 14:08:22 -0400, Robert Haas wrote:
> There is no SQL type corresponding to the C data type uint32, so I'm
> not sure why we even have DatumGetUInt32.  I'm sort of suspicious that
> there's some fuzzy thinking going on there.

I think we mostly use it for the few places where we currently expose
data as a signed integer on the SQL level, but internally actually treat
it as a unsigned data. There's not a lot of those, but there e.g. is
pg_class.relpages.  There also may be places where we use it for
functions that can be created but not called from SQL (using the
INTERNAL type).

- Andres

Re: WIP: BRIN multi-range indexes

2020-09-21 Thread Tomas Vondra


On Mon, Sep 21, 2020 at 01:42:34PM -0400, John Naylor wrote:

On Fri, Sep 18, 2020 at 6:27 PM Tomas Vondra
 wrote:


But maybe we could still use this scheme by actually computing

h1 = hash_uint32_extended(value, seed1);
h2 = hash_uint32_extended(value, seed2);

and then use this as the independent hash functions. I think that would
meet the requirements of the paper.


Yeah, that would work algorithmically. It would be trivial to add to
the patch, too of course. There'd be a higher up-front cpu cost. Also,
I'm a bit cautious of rehashing hashes, and whether the two values
above are independent enough. I'm not sure either of these points
matters. My guess is the partition approach is more sound, but it has
some minor organizational challenges (see below).



OK. I don't think rehashing hashes is an issue as long as the original
hash has sufficiently low collision rate (and while we know it's not
perfect we know it works well enough for hash indexes etc.). And I doubt
the cost of the extra hash of uint32 would be noticeable.

That being said the partitioning approach might be more sound and it's
definitely worth giving it a try.


Why would 11k bits be more than we'd like to store? Assuming we could
use the whole 8kB page for the bloom filter, that'd be about 64k bits.
In practice there'd be a bit of overhead (page header ...) but it's
still much more than 11k bits.


Brain fade -- I guess I thought we were avoiding being toasted, but
now I see that's not possible for BRIN storage. So, we'll want to
guard against this:

ERROR:  index row size 8160 exceeds maximum 8152 for index "bar_num_idx"

While playing around with the numbers I had an epiphany: At the
defaults, the filter already takes up ~4.3kB, over half the page.
There is no room for another tuple, so if we're only indexing one
column, we might as well take up the whole page.


Hmm, yeah. I may be wrong but IIRC indexes don't support external
storage but compression is still allowed. So even if those defaults are
a bit higher than needed that should make the bloom filters a bit more
compressible, and thus fit multiple BRIN tuples on a single page.


Here MT = max tuples per 128 8k pages, or 37120, so default ndistinct
is 3712.

n  k  m  p
MT/10  7  35580  0.01
MT/10  7  64000  0.0005
MT/10  12 64000  0.00025

Assuming ndistinct isn't way off reality, we get 20x-40x lower false
positive rate almost for free, and it'd be trivial to code! Keeping k
at 7 would be more robust, since it's equivalent to starting with n =
~6000, p = 0.006, which is still almost 2x less false positives than
you asked for. It also means nearly doubling the number of sorted
values before switching.

Going the other direction, capping nbits to 64k bits when ndistinct
gets too high, the false positive rate we can actually support starts
to drop. Here, the user requested 0.001 fpr.

n   k  p
45009  0.001
60007  0.006
75006  0.017
15000   3  0.129  (probably useless by now)
MT  1  0.440
64000   1  0.63 (possible with > 128 pages per range)

I imagine smaller pages_per_range settings are going to be useful for
skinny tables (note to self -- test). Maybe we could provide a way for
the user to see that their combination of pages_per_range, false
positive rate, and ndistinct is supportable, like
brin_bloom_get_supported_fpr(). Or document to check with page_inspect.
And that's not even considering multi-column indexes, like you
mentioned.



I agree giving users visibility into this would be useful.

Not sure about how much we want to rely on these optimizations, though,
considering multi-column indexes kinda break this.


But I guess we can simply make the table of primes a bit larger,
right?


If we want to support all the above cases without falling down
entirely, it would have to go up to 32k to be safe (When k = 1 we could
degenerate to one modulus on the whole filter). That would be a table
of about 7kB, which we could binary search. [thinks for a
moment]...Actually, if we wanted to be clever, maybe we could
precalculate the primes needed for the 64k bit cases and stick them at
the end of the array. The usual algorithm will find them. That way, we
could keep the array around 2kB. However, for >8kB block size, we
couldn't adjust the 64k number, which might be okay, but not really
project policy.

We could also generate the primes via a sieve instead, which is really
fast (and more code). That would be more robust, but that would require
the filter to store the actual primes used, so 20 more bytes at max k =
10. We could hard-code space for that, or to keep from hard-coding
maximum k and thus lowest possible false positive rate, we'd need more
bookkeeping.



I don't think the efficiency of this code matters too much - it's only
used once when creating the index, so the simpler the better. Certainly
for now, while testing the partitioning approach.


So, with the partition approach, we'd likely have to set in stone
either max nbits, or min targ

Re: Feature improvement for pg_stat_statements

Hi,

On 2020-09-18 17:55:12 +0900, btnakamichin wrote:
> I’m thinking of adding adding a function called
> pg_stat_statements_reset_time() that returns the last timestamp when
> executed pg_stat_statements_reset(). pg_stat_statements can reset each SQL
> statement. We can record each sql reset timing timestamp but I don’t feel
> the need. This time, I’m thinking of updating the reset timestamp only when
> all stats were reset.

What exactly do you mean by "can reset each SQL statement"? I don't
really see what alternative you're analyzing?

It does make sense to me to have a function returning the time of the
last reset.

Greetings,

Andres Freund

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Mon, Sep 21, 2020 at 2:21 PM Tom Lane  wrote:
> Right, but what we end up with is that the very same tuple xmin and
> xmax might result in pruning/deletion, or not, depending on whether
> it's part of a HOT chain or not.  That's at best pretty weird, and
> at worst it means that corner-case bugs in other places are triggered
> in only one of the two scenarios ... which is what we have here.

I'm not sure I really understand how that's happening, because surely
HOT chains and non-HOT chains are pruned by the same code, but it
doesn't sound good.

> FWIW, weakening the sanity checks in heap_prepare_freeze_tuple is
> *not* my preferred fix here.  But it'll take some work in other
> places to preserve them.

Make sense.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: recovering from "found xmin ... from before relfrozenxid ..."

Hi,

On 2020-09-21 16:02:29 -0400, Robert Haas wrote:
> On Mon, Sep 21, 2020 at 2:21 PM Tom Lane  wrote:
> > Right, but what we end up with is that the very same tuple xmin and
> > xmax might result in pruning/deletion, or not, depending on whether
> > it's part of a HOT chain or not.  That's at best pretty weird, and
> > at worst it means that corner-case bugs in other places are triggered
> > in only one of the two scenarios ... which is what we have here.
> 
> I'm not sure I really understand how that's happening, because surely
> HOT chains and non-HOT chains are pruned by the same code, but it
> doesn't sound good.

Not necessarily, unfortunately:

case HEAPTUPLE_DEAD:

/*
 * Ordinarily, DEAD tuples would have been removed by
 * heap_page_prune(), but it's possible that the tuple
 * state changed since heap_page_prune() looked.  In
 * particular an INSERT_IN_PROGRESS tuple could have
 * changed to DEAD if the inserter aborted.  So this
 * cannot be considered an error condition.
 *
 * If the tuple is HOT-updated then it must only be
 * removed by a prune operation; so we keep it just as if
 * it were RECENTLY_DEAD.  Also, if it's a heap-only
 * tuple, we choose to keep it, because it'll be a lot
 * cheaper to get rid of it in the next pruning pass than
 * to treat it like an indexed tuple. Finally, if index
 * cleanup is disabled, the second heap pass will not
 * execute, and the tuple will not get removed, so we must
 * treat it like any other dead tuple that we choose to
 * keep.
 *
 * If this were to happen for a tuple that actually needed
 * to be deleted, we'd be in trouble, because it'd
 * possibly leave a tuple below the relation's xmin
 * horizon alive.  heap_prepare_freeze_tuple() is prepared
 * to detect that case and abort the transaction,
 * preventing corruption.
 */
if (HeapTupleIsHotUpdated(&tuple) ||
HeapTupleIsHeapOnly(&tuple) ||
params->index_cleanup == VACOPT_TERNARY_DISABLED)
nkeep += 1;
else
tupgone = true; /* we can delete the tuple */
all_visible = false;

So if e.g. a transaction aborts between the heap_page_prune and this
check the pruning behaviour depends on whether the tuple is part of a
HOT chain or not.

Greetings,

Andres Freund

Re: recovering from "found xmin ... from before relfrozenxid ..."

Robert Haas  writes:
> On Mon, Sep 21, 2020 at 2:21 PM Tom Lane  wrote:
>> Right, but what we end up with is that the very same tuple xmin and
>> xmax might result in pruning/deletion, or not, depending on whether
>> it's part of a HOT chain or not.  That's at best pretty weird, and
>> at worst it means that corner-case bugs in other places are triggered
>> in only one of the two scenarios ... which is what we have here.

> I'm not sure I really understand how that's happening, because surely
> HOT chains and non-HOT chains are pruned by the same code, but it
> doesn't sound good.

No, they're not.  lazy_scan_heap() will never remove a tuple that
is HeapTupleIsHotUpdated or HeapTupleIsHeapOnly, even if it thinks
it's DEAD -- cf. vacuumlazy.c, about line 1350.  So tuples in
a HOT chain are deleted exactly when pruneheap.c sees fit to do so.
OTOH, for tuples not in a HOT chain, the decision is (I believe)
entirely on lazy_scan_heap().  And the core of my complaint is that
pruneheap.c's decisions about what is DEAD are not reliably identical
to what HeapTupleSatisfiesVacuum thinks.

I don't mind if a free-standing prune operation has its own rules,
but when it's invoked by VACUUM it ought to follow VACUUM's rules about
what is dead or alive.  What remains unclear at this point is whether
we ought to import some of the intelligence added by the GlobalVisState
patch into VACUUM's behavior, or just dumb down pruneheap.c so that
it applies exactly the HeapTupleSatisfiesVacuum rules when invoked
by VACUUM.

regards, tom lane

Re: recovering from "found xmin ... from before relfrozenxid ..."

Hi,

On 2020-09-20 13:13:16 -0400, Tom Lane wrote:
> 2. As Amit suspected, there's an inconsistency between pruneheap.c's
> rules for which tuples are removable and vacuum.c's rules for that.
> This seems like a massive bug in its own right: what's the point of
> pruneheap.c going to huge effort to decide whether it should keep a
> tuple if vacuum will then kill it anyway?  I do not understand why
> whoever put in the GlobalVisState stuff only applied it in pruneheap.c
> and not VACUUM proper.

The reason for that is that the GlobalVisState stuff is computed
heuristically (and then re-checked if that's not sufficient to prune a
tuple, unless already done so). That's done so GetSnapshotData() doesn't
have to look at each backends ->xmin, which is quite a massive speedup
at higher connection counts, as each backends ->xmin changes much more
often than each backend's xid.

But for VACUUM we need to do the accurate scan of the procarray anyway,
because we need an accurate value for things other than HOT pruning
decisions.

What do you exactly mean with the "going to huge effort to decide" bit?

> I think to move forward, we need to figure out what the freezing
> behavior ought to be for temp tables.  We could make it the same
> as it was before a7212be8b, which'd just require some more complexity
> in vacuum_set_xid_limits.  However, that negates the idea that we'd
> like VACUUM's behavior on a temp table to be fully independent of
> whether concurrent transactions exist.  I'd prefer to allow a7212be8b's
> behavior to stand, but then it seems we need to lobotomize the error
> check in heap_prepare_freeze_tuple to some extent.

I think that's an argument for what I suggested elsewhere, which is that
we should move the logic for a different horizon for temp tables out of
vacuum_set_xid_limits, and into procarray.

> Independently of that, it seems like we need to fix things so that
> when pruneheap.c is called by vacuum, it makes EXACTLY the same
> dead-or-not-dead decisions that the main vacuum code makes.  This
> business with applying some GlobalVisState rule or other instead
> seems just as unsafe as can be.

It's not great, I agree. Not sure there is a super nice answer
though. Note that, even before my changes, vacuumlazy can decide
differently than pruning whether a tuple is live. E.g. when an inserting
transaction aborts. That's pretty much unavoidable as long as we have
multiple HTSV calls for a tuple, since none of our locking can (nor
should) prevent concurrent transactions from aborting.

Before your new code avoiding the GetOldestNonRemovableTransactionId()
call for temp tables, the GlobalVis* can never be more pessimistic than
decisions based ona prior GetOldestNonRemovableTransactionId call (as
that internally updates the heuristic horizons used by GlobalVis).

Greetings,

Andres Freund

Re: recovering from "found xmin ... from before relfrozenxid ..."

Andres Freund  writes:
> The reason for that is that the GlobalVisState stuff is computed
> heuristically (and then re-checked if that's not sufficient to prune a
> tuple, unless already done so). That's done so GetSnapshotData() doesn't
> have to look at each backends ->xmin, which is quite a massive speedup
> at higher connection counts, as each backends ->xmin changes much more
> often than each backend's xid.

OK.

> What do you exactly mean with the "going to huge effort to decide" bit?

I'd looked at all the complexity around GlobalVisState, but failed to
register that it should be pretty cheap on a per-tuple basis.  So never
mind that complaint.  The point that remains is just that it's different
from HeapTupleSatisfiesVacuum's rules.

>> I think to move forward, we need to figure out what the freezing
>> behavior ought to be for temp tables.  We could make it the same
>> as it was before a7212be8b, which'd just require some more complexity
>> in vacuum_set_xid_limits.  However, that negates the idea that we'd
>> like VACUUM's behavior on a temp table to be fully independent of
>> whether concurrent transactions exist.  I'd prefer to allow a7212be8b's
>> behavior to stand, but then it seems we need to lobotomize the error
>> check in heap_prepare_freeze_tuple to some extent.

> I think that's an argument for what I suggested elsewhere, which is that
> we should move the logic for a different horizon for temp tables out of
> vacuum_set_xid_limits, and into procarray.

But procarray does not seem like a great place for
table-persistence-dependent decisions either?

>> Independently of that, it seems like we need to fix things so that
>> when pruneheap.c is called by vacuum, it makes EXACTLY the same
>> dead-or-not-dead decisions that the main vacuum code makes.  This
>> business with applying some GlobalVisState rule or other instead
>> seems just as unsafe as can be.

> It's not great, I agree. Not sure there is a super nice answer
> though. Note that, even before my changes, vacuumlazy can decide
> differently than pruning whether a tuple is live. E.g. when an inserting
> transaction aborts. That's pretty much unavoidable as long as we have
> multiple HTSV calls for a tuple, since none of our locking can (nor
> should) prevent concurrent transactions from aborting.

It's clear that if the environment changes between test A and test B,
we might get different results.  What I'm not happy about is that the
rules are different, so we might get different results even if the
environment did not change.

regards, tom lane

Re: Parallel Full Hash Join

2020-09-21 Thread Melanie Plageman

On Wed, Sep 11, 2019 at 11:23 PM Thomas Munro 
wrote:

>
> While thinking about looping hash joins (an alternative strategy for
> limiting hash join memory usage currently being investigated by
> Melanie Plageman in a nearby thread[1]), the topic of parallel query
> deadlock hazards came back to haunt me.  I wanted to illustrate the
> problems I'm aware of with the concrete code where I ran into this
> stuff, so here is a new-but-still-broken implementation of $SUBJECT.
> This was removed from the original PHJ submission when I got stuck and
> ran out of time in the release cycle for 11.  Since the original
> discussion is buried in long threads and some of it was also a bit
> confused, here's a fresh description of the problems as I see them.
> Hopefully these thoughts might help Melanie's project move forward,
> because it's closely related, but I didn't want to dump another patch
> into that other thread.  Hence this new thread.
>
> I haven't succeeded in actually observing a deadlock with the attached
> patch (though I did last year, very rarely), but I also haven't tried
> very hard.  The patch seems to produce the right answers and is pretty
> scalable, so it's really frustrating not to be able to get it over the
> line.
>
> Tuple queue deadlock hazard:
>
> If the leader process is executing the subplan itself and waiting for
> all processes to arrive in ExecParallelHashEndProbe() (in this patch)
> while another process has filled up its tuple queue and is waiting for
> the leader to read some tuples an unblock it, they will deadlock
> forever.  That can't happen in the the committed version of PHJ,
> because it never waits for barriers after it has begun emitting
> tuples.
>
> Some possible ways to fix this:
>
> 1.  You could probably make it so that the PHJ_BATCH_SCAN_INNER phase
> in this patch (the scan for unmatched tuples) is executed by only one
> process, using the "detach-and-see-if-you-were-last" trick.  Melanie
> proposed that for an equivalent problem in the looping hash join.  I
> think it probably works, but it gives up a lot of parallelism and thus
> won't scale as nicely as the attached patch.
>

I have attached a patch which implements this
(v1-0001-Parallel-FOJ-ROJ-single-worker-scan-buckets.patch).

For starters, in order to support parallel FOJ and ROJ, I re-enabled
setting the match bit for the tuples in the hashtable which
3e4818e9dd5be294d97c disabled. I did so using the code suggested in [1],
reading the match bit to see if it is already set before setting it.

Then, workers except for the last worker detach after exhausting the
outer side of a batch, leaving one worker to proceed to HJ_FILL_INNER
and do the scan of the hash table and emit unmatched inner tuples.

I have also attached a variant on this patch which I am proposing to
replace it (v1-0001-Parallel-FOJ-ROJ-single-worker-scan-chunks.patch)
which has a new ExecParallelScanHashTableForUnmatched() in which the
single worker doing the unmatched scan scans one HashMemoryChunk at a
time and then frees them as it goes. I thought this might perform better
than the version which uses the buckets because 1) it should do a bit
less pointer chasing and 2) it frees each chunk of the hash table as it
scans it which (maybe) would save a bit of time during
ExecHashTableDetachBatch() when it goes through and frees the hash
table, but, my preliminary tests showed a negligible difference between
this and the version using buckets. I will do a bit more testing,
though.

I tried a few other variants of these patches, including one in which
the workers detach from the batch inside of the batch loading and
probing phase machine, ExecParallelHashJoinNewBatch(). This meant that
all workers transition to HJ_FILL_INNER and then HJ_NEED_NEW_BATCH in
order to detach in the batch phase machine. This, however, involved
adding a lot of new variables to distinguish whether or or not the
unmatched outer scan was already done, whether or not the current worker
was the worker elected to do the scan, etc. Overall, it is probably
incorrect to use the HJ_NEED_NEW_BATCH state in this way. I had
originally tried this to avoid operating on the batch_barrier in the
main hash join state machine. I've found that the more different places
we add code attaching and detaching to the batch_barrier (and other PHJ
barriers, for that matter), the harder it is to debug the code, however,
I think in this case it is required.

> 2.  You could probably make it so that only the leader process drops
> out of executing the inner unmatched scan, and then I think you
> wouldn't have this very specific problem at the cost of losing some
> (but not all) parallelism (ie the leader), but there might be other
> variants of the problem.  For example, a GatherMerge leader process
> might be blocked waiting for the next tuple for a tuple from P1, while
> P2 is try to write to a full queue, and P1 waits for P2.
>
> 3.  You could introduce some kind of overflow for tuple queues, so
>

Re: recovering from "found xmin ... from before relfrozenxid ..."

Hi,

On 2020-09-21 16:40:40 -0400, Tom Lane wrote:
> Andres Freund  writes:
> >> I think to move forward, we need to figure out what the freezing
> >> behavior ought to be for temp tables.  We could make it the same
> >> as it was before a7212be8b, which'd just require some more complexity
> >> in vacuum_set_xid_limits.  However, that negates the idea that we'd
> >> like VACUUM's behavior on a temp table to be fully independent of
> >> whether concurrent transactions exist.  I'd prefer to allow a7212be8b's
> >> behavior to stand, but then it seems we need to lobotomize the error
> >> check in heap_prepare_freeze_tuple to some extent.
> 
> > I think that's an argument for what I suggested elsewhere, which is that
> > we should move the logic for a different horizon for temp tables out of
> > vacuum_set_xid_limits, and into procarray.
> 
> But procarray does not seem like a great place for
> table-persistence-dependent decisions either?

That ship has sailed a long long time ago though. GetOldestXmin() has
looked at the passed in relation for a quite a while, and even before
that we had logic about 'allDbs' etc.  It doesn't easily seem possible
to avoid that, given how intimately that's coupled with how snapshots
are built and used, database & vacuumFlags checks etc.

Greetings,

Andres Freund

Re: recovering from "found xmin ... from before relfrozenxid ..."

Andres Freund  writes:
> On 2020-09-21 16:40:40 -0400, Tom Lane wrote:
>> Andres Freund  writes:
>>> I think that's an argument for what I suggested elsewhere, which is that
>>> we should move the logic for a different horizon for temp tables out of
>>> vacuum_set_xid_limits, and into procarray.

>> But procarray does not seem like a great place for
>> table-persistence-dependent decisions either?

> That ship has sailed a long long time ago though. GetOldestXmin() has
> looked at the passed in relation for a quite a while, and even before
> that we had logic about 'allDbs' etc.  It doesn't easily seem possible
> to avoid that, given how intimately that's coupled with how snapshots
> are built and used, database & vacuumFlags checks etc.

OK.  Given that you've got strong feelings about this, do you want to
propose a patch?  I'm happy to fix it, since it's at least in part my
bug, but I probably won't do it exactly like you would.

regards, tom lane

Re: Handing off SLRU fsyncs to the checkpointer

2020-09-21 Thread Thomas Munro

On Mon, Sep 21, 2020 at 2:19 PM Thomas Munro  wrote:
> While scanning for comments and identifier names that needed updating,
> I realised that this patch changed the behaviour of the ShutdownXXX()
> functions, since they currently flush the SLRUs but are not followed
> by a checkpoint.  I'm not entirely sure I understand the logic of
> that, but it wasn't my intention to change it.  So here's a version
> that converts the existing fsync_fname() to fsync_fname_recurse() to

Bleugh, that was probably a bad idea, it's too expensive.  But it
forces me to ask the question: *why* do we need to call
Shutdown{CLOG,CommitTS,SUBTRANS, MultiXact}() after a creating a
shutdown checkpoint?  I wondered if this might date from before the
WAL, but I see that the pattern was introduced when the CLOG was moved
out of shared buffers into a proto-SLRU in ancient commit 2589735da08,
but even in that commit the preceding CreateCheckPoint() call included
a call to CheckPointCLOG().

Re: new heapcheck contrib module

On Tue, Aug 25, 2020 at 10:36 AM Mark Dilger
 wrote:
> Thanks for the review!

+ msg OUT text
+ )

Looks like atypical formatting.

+REVOKE ALL ON FUNCTION
+verify_heapam(regclass, boolean, boolean, cstring, bigint, bigint)
+FROM PUBLIC;

This too.

+-- Don't want this to be available to public

Add "by default, but superusers can grant access" or so?

I think there should be a call to pg_class_aclcheck() here, just like
the one in pg_prewarm, so that if the superuser does choose to grant
access, users given access can check tables they anyway have
permission to access, but not others. Maybe put that in
check_relation_relkind_and_relam() and rename it. Might want to look
at the pg_surgery precedent, too. Oh, and that functions header
comment is also wrong.

I think that the way the checks on the block range are performed could
be improved. Generally, we want to avoid reporting the same problem
with a variety of different message strings, because it adds burden
for translators and is potentially confusing for users. You've got two
message strings that are only going to be used for empty relations and
a third message string that is only going to be used for non-empty
relations. What stops you from just ripping off the way that this is
done in pg_prewarm, which requires only 2 messages? Then you'd be
adding a net total of 0 new messages instead of 3, and in my view they
would be clearer than your third message, "block range is out of
bounds for relation with block count %u: " INT64_FORMAT " .. "
INT64_FORMAT, which doesn't say very precisely what the problem is,
and also falls afoul of our usual practice of avoiding the use of
INT64_FORMAT in error messages that are subject to translation. I
notice that pg_prewarm just silently does nothing if the start and end
blocks are swapped, rather than generating an error. We could choose
to do differently here, but I'm not sure why we should bother.

+   all_frozen = mapbits & VISIBILITYMAP_ALL_VISIBLE;
+   all_visible = mapbits & VISIBILITYMAP_ALL_FROZEN;
+
+   if ((all_frozen && skip_option ==
SKIP_PAGES_ALL_FROZEN) ||
+   (all_visible && skip_option ==
SKIP_PAGES_ALL_VISIBLE))
+   {
+   continue;
+   }

This isn't horrible style, but why not just get rid of the local
variables? e.g. if (skip_option == SKIP_PAGES_ALL_FROZEN) { if
((mapbits & VISIBILITYMAP_ALL_FROZEN) != 0) continue; } else { ... }

Typically no braces around a block containing only one line.

+ * table contains corrupt all frozen bits, a concurrent vacuum might skip the

all-frozen?

+ * relfrozenxid beyond xid.) Reporting the xid as valid under such conditions
+ * seems acceptable, since if we had checked it earlier in our scan it would
+ * have truly been valid at that time, and we break no MVCC guarantees by
+ * failing to notice the concurrent change in its status.

I agree with the first half of this sentence, but I don't know what
MVCC guarantees have to do with anything. I'd just delete the second
part, or make it a lot clearer.

+ * Some kinds of tuple header corruption make it unsafe to check the tuple
+ * attributes, for example when the tuple is foreshortened and such checks
+ * would read beyond the end of the line pointer (and perhaps the page).  In

I think of foreshortening mostly as an art term, though I guess it has
other meanings. Maybe it would be clearer to say something like "Some
kinds of corruption make it unsafe to check the tuple attributes, for
example when the line pointer refers to a range of bytes outside the
page"?

+ * Other kinds of tuple header corruption do not bare on the question of

bear

+ pstrdup(_("updating
transaction ID marked incompatibly as keys updated and locked
only")));
+ pstrdup(_("updating
transaction ID marked incompatibly as committed and as a
multitransaction ID")));

"updating transaction ID" might scare somebody who thinks that you are
telling them that you changed something. That's not what it means, but
it might not be totally clear. Maybe:

tuple is marked as only locked, but also claims key columns were updated
multixact should not be marked committed

+
psprintf(_("data offset differs from expected: %u vs. %u (1 attribute,
has nulls)"),

For these, how about:

tuple data should begin at byte %u, but actually begins at byte %u (1
attribute, has nulls)
etc.

+
psprintf(_("old-style VACUUM FULL transaction ID is in the future:
%u"),
+
psprintf(_("old-style VACUUM FULL transaction ID precedes freeze
threshold: %u"),
+
psprintf(_("old-style VACUUM FULL transaction ID is invalid in this
relation: %u"),

old-style VACUUM FULL transaction ID %u is in the future
old-style VACUUM FULL transact

Re: Command statistics system (cmdstats)

On Mon, Sep 21, 2020 at 9:41 AM Alvaro Herrera  wrote:
> "A couple of weeks" of inactivity is not sufficient, in my view, to boot
> a patch out of the commitfest process.  Whenever the patch is
> resurrected, it will be a new entry which won't have the history that it
> had accumulated in the long time since it was created -- which biases
> it against other new patches.

As Michael said, it had been inactive for three *months*. I don't
think there's a big problem here. I think that the redesign of the
CommitFest application encourages carrying things along from CF to CF
forever, instead of getting rid of things that aren't wanted or aren't
making any progress. That's work we don't need. There ought to be a
way to mark a patch RwF when nothing's happen that lets it be revived
later if and when someone gets around to resubmitting, but I think
that's not possible now.

Then, too, since we have email integration now, maybe we ought to do
that automatically if the thread isn't getting updated and the patch
is setting there waiting on author. It's a real waste of CFM time to
chase down and kick out obviously-inactive patches, and if the CFM is
going to get flack for it then that's even worse. Like, do we have 170
patches now because we have more activity than a few years ago, or
just because we've become more reluctant to boot things? I don't know,
but to the extent it's from unwillingness to boot things, I think
that's bad. Discouraging contributors is not good, but wasting CFM and
commiter time is also bad. You've got to draw a line someplace or
you'll go nuts.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: recovering from "found xmin ... from before relfrozenxid ..."

On 2020-09-21 17:03:53 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2020-09-21 16:40:40 -0400, Tom Lane wrote:
> >> Andres Freund  writes:
> >>> I think that's an argument for what I suggested elsewhere, which is that
> >>> we should move the logic for a different horizon for temp tables out of
> >>> vacuum_set_xid_limits, and into procarray.
> 
> >> But procarray does not seem like a great place for
> >> table-persistence-dependent decisions either?
> 
> > That ship has sailed a long long time ago though. GetOldestXmin() has
> > looked at the passed in relation for a quite a while, and even before
> > that we had logic about 'allDbs' etc.  It doesn't easily seem possible
> > to avoid that, given how intimately that's coupled with how snapshots
> > are built and used, database & vacuumFlags checks etc.
> 
> OK.  Given that you've got strong feelings about this, do you want to
> propose a patch?  I'm happy to fix it, since it's at least in part my
> bug, but I probably won't do it exactly like you would.

I can give it a try. I can see several paths of varying invasiveness,
not sure yet what the best approach is. Let me think about if for a bit.

Greetings,

Andres Freund

Re: Compatible defaults for LEAD/LAG

Pavel Stehule  writes:
> I see few possibilities how to finish and close this issue:
> 1. use anycompatible type and add note to documentation so expression of
> optional argument can change a result type, and so this is Postgres
> specific - other databases and ANSI SQL disallow this.
> It is the most simple solution, and it is good enough for this case, that
> is not extra important.
> 2. choose the correct type somewhere inside the parser - for these two
> functions.
> 3. introduce and implement some generic solution for this case - it can be
> implemented on C level via some function helper or on syntax level
>CREATE OR REPLACE FUNCTION lag(a anyelement, offset int, default defval
> a%type) ...
> "defval argname%type" means for caller's expression "CAST(defval AS
> typeof(argname))"

I continue to feel that the spec's definition of this is not so
obviously right that we should jump through hoops to duplicate it.
In fact, I don't even agree that we need a disclaimer in the docs
saying that it's not quite the same.  Only the most nitpicky
language lawyers would ever even notice.

In hopes of moving this along a bit, I experimented with converting
the other functions I listed to use anycompatible.  I soon found that
touching any functions/operators that are listed in operator classes
would open a can of worms far larger than I'd previously supposed.
To maintain consistency, we'd have to propagate the datatype changes
to *every* function/operator associated with those opfamilies --- many
of which only have one any-foo input and thus aren't on my original
list.  (An example here is that extending btree array_ops will require
changing the max(anyarray) and min(anyarray) aggregates too.)  We'd
then end up with a situation that would be rather confusing from a
user's standpoint, in that it'd be quite un-obvious why some array
functions use anyarray while other ones use anycompatiblearray.

So I backed off to just changing the functions/operators that have
no opclass connections, such as array_cat.  Even that has some
downsides --- for example, in the attached patch, it's necessary
to change some polymorphism.sql examples that explicitly reference
array_cat(anyarray).  I wonder whether this change would break a
significant number of user-defined aggregates or operators.

(Note that I think we'd have to resist the temptation to fix that
by letting CREATE AGGREGATE et al accept support functions that
take anyarray/anycompatiblearray (etc) interchangeably.  A lot of
the security analysis that went into CVE-2020-14350 depended on
the assumption that utility commands only do exact lookups of
support functions.  If we tried to be lax about this, we'd
re-introduce the possibility for hostile capture of function
references in extension scripts.)

Another interesting issue, not seen in the attached but which
came up while I was experimenting with the more aggressive patch,
was this failure in the polymorphism test:

 select max(histogram_bounds) from pg_stats where tablename = 'pg_am';
-ERROR:  cannot compare arrays of different element types
+ERROR:  function max(anyarray) does not exist

That's because we don't accept pg_stats.histogram_bounds (of
declared type anyarray) as a valid input for anycompatiblearray.
I wonder if that isn't a bug we need to fix in the anycompatible
patch, independently of anything else.  We'd not be able to deduce
an actual element type from such an input, but we already cannot
do so when we match it to an anyarray parameter.

Anyway, attached find

0001 - updates Vik's original patch to HEAD and tweaks the
grammar in the docs a bit.

0002 - add-on patch to convert array_append, array_prepend,
array_cat, array_position, array_positions, array_remove,
array_replace, and width_bucket to use anycompatiblearray.

I think 0001 is committable, but 0002 is just WIP since
I didn't touch the docs.  I'm slightly discouraged about
whether 0002 is worth proceeding with.  Any thoughts?

regards, tom lane

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 461b748d89..14cb8e2ce6 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -19621,17 +19621,17 @@ SELECT count(*) FROM sometable;

  lag

-lag ( value anyelement
+lag ( value anycompatible
   , offset integer
-  , default anyelement  )
-anyelement
+  , default anycompatible  )
+anycompatible

 Returns value evaluated at
 the row that is offset
 rows before the current row within the partition; if there is no such
 row, instead returns default
-(which must be of the same type as
+(which must be of a type compatible with
 value).
 Both offset and
 default are evaluated
@@ -19646,17 +19646,17 @@ SELECT count(*) FROM sometable;

  lead

-lead ( value anyelement
+lead ( value anycompatible

RE: Global snapshots

2020-09-21 Thread tsunakawa.ta...@fujitsu.com

Hi Andrey-san, all,

From: Andrey V. Lepikhov 
> On 7/27/20 11:22 AM, tsunakawa.ta...@fujitsu.com wrote:
> > Could you take a look at this patent?  I'm afraid this is the Clock-SI for 
> > MVCC.
> Microsoft holds this until 2031.  I couldn't find this with the keyword
> "Clock-SI.""
> >
> >
> > US8356007B2 - Distributed transaction management for database systems
> with multiversioning - Google Patents
> > https://patents.google.com/patent/US8356007
> >
> >
> > If it is, can we circumvent this patent?

> I haven't seen this patent before. This should be carefully studied.

I contacted 6 people individually, 3 holders of the patent and different 3 
authors of the Clock-SI paper.  I got replies from two people.  (It's a regret 
I couldn't get a reply from the main author of Clock-SI paper.)

[Reply from the patent holder Per-Ake Larson]
--
Thanks for your interest in my patent. 

The answer to your question is: No, Clock-SI is not based on the patent - it 
was an entirely independent development. The two approaches are similar in the 
sense that there is no global clock, the commit time of a distributed 
transaction is the same in every partition where it modified data, and a 
transaction gets it snapshot timestamp from a local clock. The difference is 
whether a distributed transaction gets its commit timestamp before or after the 
prepare phase in 2PC.

Hope this helpful.

Best regards,
Per-Ake
--

[Reply from the Clock-SI author Willy Zwaenepoel]
--
Thank you for your kind words about our work.

I was unaware of this patent at the time I wrote the paper. The two came out 
more or less at the same time.

I am not a lawyer, so I cannot tell you if something based on Clock-SI would 
infringe on the Microsoft patent. The main distinction to me seems to be that 
Clock-SI is based on physical clocks, while the Microsoft patent talks about 
logical clocks, but again I am not a lawyer.

Best regards,

Willy.
--

Does this make sense from your viewpoint, and can we think that we can use 
Clock-SI without infrindging on the patent?  According to the patent holder, 
the difference between Clock-SI and the patent seems to be fewer than the 
similarities.

Regards
Takayuki Tsunakawa

Re: Index Skip Scan (new UniqueKeys)

2020-09-21 Thread Peter Geoghegan

On Sat, Aug 15, 2020 at 7:09 AM Dmitry Dolgov <9erthali...@gmail.com> wrote:
> Here is a new version that hopefully address most of the concerns
> mentioned in this thread so far. As before, first two patches are taken
> from UniqueKeys thread and attached only for the reference. List of
> changes includes:

Some thoughts on this version of the patch series (I'm focussing on
v36-0005-Btree-implementation-of-skipping.patch again):

* I see the following compiler warning:

/code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c:
In function ‘populate_baserel_uniquekeys’:
/code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c:797:13:
warning: ‘expr’ may be used uninitialized in this function
[-Wmaybe-uninitialized]
  797 |   else if (!list_member(unique_index->rel->reltarget->exprs, expr))
  | ^~

* Perhaps the warning is related to this nearby code that I noticed
Valgrind complains about:

==1083468== VALGRINDERROR-BEGIN
==1083468== Invalid read of size 4
==1083468==at 0x59568A: get_exprs_from_uniqueindex (uniquekeys.c:771)
==1083468==by 0x593C5B: populate_baserel_uniquekeys (uniquekeys.c:140)
==1083468==by 0x56AEA5: set_plain_rel_size (allpaths.c:586)
==1083468==by 0x56AADB: set_rel_size (allpaths.c:412)
==1083468==by 0x56A8CD: set_base_rel_sizes (allpaths.c:323)
==1083468==by 0x56A5A7: make_one_rel (allpaths.c:185)
==1083468==by 0x5AB426: query_planner (planmain.c:269)
==1083468==by 0x5AF02C: grouping_planner (planner.c:2058)
==1083468==by 0x5AD202: subquery_planner (planner.c:1015)
==1083468==by 0x5ABABF: standard_planner (planner.c:405)
==1083468==by 0x5AB7F8: planner (planner.c:275)
==1083468==by 0x6E6F84: pg_plan_query (postgres.c:875)
==1083468==by 0x6E70C4: pg_plan_queries (postgres.c:966)
==1083468==by 0x6E7497: exec_simple_query (postgres.c:1158)
==1083468==by 0x6EBCD3: PostgresMain (postgres.c:4309)
==1083468==by 0x624284: BackendRun (postmaster.c:4541)
==1083468==by 0x623995: BackendStartup (postmaster.c:4225)
==1083468==by 0x61FB70: ServerLoop (postmaster.c:1742)
==1083468==by 0x61F309: PostmasterMain (postmaster.c:1415)
==1083468==by 0x514AF2: main (main.c:209)
==1083468==  Address 0x75f13e0 is 4,448 bytes inside a block of size
8,192 alloc'd
==1083468==at 0x483B7F3: malloc (in
/usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==1083468==by 0x8C15C8: AllocSetAlloc (aset.c:919)
==1083468==by 0x8CEA52: palloc (mcxt.c:964)
==1083468==by 0x267F25: systable_beginscan (genam.c:373)
==1083468==by 0x8682CE: SearchCatCacheMiss (catcache.c:1359)
==1083468==by 0x868167: SearchCatCacheInternal (catcache.c:1299)
==1083468==by 0x867E2C: SearchCatCache1 (catcache.c:1167)
==1083468==by 0x8860B2: SearchSysCache1 (syscache.c:1123)
==1083468==by 0x8BD482: check_enable_rls (rls.c:66)
==1083468==by 0x68A113: get_row_security_policies (rowsecurity.c:134)
==1083468==by 0x683C2C: fireRIRrules (rewriteHandler.c:2045)
==1083468==by 0x687340: QueryRewrite (rewriteHandler.c:3962)
==1083468==by 0x6E6EB1: pg_rewrite_query (postgres.c:784)
==1083468==by 0x6E6D23: pg_analyze_and_rewrite (postgres.c:700)
==1083468==by 0x6E7476: exec_simple_query (postgres.c:1155)
==1083468==by 0x6EBCD3: PostgresMain (postgres.c:4309)
==1083468==by 0x624284: BackendRun (postmaster.c:4541)
==1083468==by 0x623995: BackendStartup (postmaster.c:4225)
==1083468==by 0x61FB70: ServerLoop (postmaster.c:1742)
==1083468==by 0x61F309: PostmasterMain (postmaster.c:1415)
==1083468==
==1083468== VALGRINDERROR-END

(You'll see the same error if you run Postgres Valgrind + "make
installcheck", though I don't think that the queries in question are
tests that you yourself wrote.)

* IndexScanDescData.xs_itup comments could stand to be updated here --
IndexScanDescData.xs_want_itup is no longer just about index-only
scans.

* Do we really need the AM-level boolean flag/argument named
"scanstart"? Why not just follow the example of btgettuple(), which
determines whether or not the scan has been initialized based on the
current scan position?

Just because you set so->currPos.buf to InvalidBuffer doesn't mean you
cannot or should not take the same approach as btgettuple(). And even
if you can't take exactly the same approach, I would still think that
the scan's opaque B-Tree state should remember if it's the first call
to _bt_skip() (rather than some subsequent call) in some other way
(e.g. carrying a "scanstart" bool flag directly).

A part of my objection to "scanstart" is that it seems to require that
much of the code within _bt_skip() get another level of
indentation...which makes it even more difficult to follow.

* I don't understand what _bt_scankey_within_page() comments mean when
they refer to "the page highkey". It looks like this function examines
the highest data

RE: Transactions involving multiple postgres foreign servers, take 2

2020-09-21 Thread tsunakawa.ta...@fujitsu.com

From: Masahiko Sawada 
> Yes, but it still seems hard to me that we require for all FDW
> implementations to commit/rollback prepared transactions without the
> possibility of ERROR.

Of course we can't eliminate the possibility of error, because remote servers 
require network communication.  What I'm saying is to just require the FDW to 
return error like xa_commit(), not throwing control away with ereport(ERROR).  
I don't think it's too strict.


> I think it's not necessarily that all FDW implementations need to be
> able to support xa_complete(). We can support both synchronous and
> asynchronous executions of prepare/commit/rollback.

Yes, I think parallel prepare and commit can be an option for FDW.  But I don't 
think it's an option for a serious scale-out DBMS.  If we want to use FDW as 
part of PostgreSQL's scale-out infrastructure, we should design (if not 
implemented in the first version) how the parallelism can be realized.  That 
design is also necessary because it could affect the FDW API.


> If you're concerned that executing a UDF function by like 'SELECT
> myfunc();' updates data on a foreign server, since the UDF should know
> which foreign server it modifies data on it should be able to register
> the foreign server and mark as modified. Or you’re concerned that a
> UDF function in WHERE condition is pushed down and updates data (e.g.,
>  ‘SELECT … FROM foreign_tbl WHERE id = myfunc()’)?

What I had in mind is "SELECT myfunc(...) FROM mytable WHERE col = ...;"  Does 
the UDF call get pushed down to the foreign server in this case?  If not now, 
could it be pushed down in the future?  If it could be, it's worth considering 
how to detect the remote update now.


Regards
Takayuki Tsunakawa

Load TIME fields - proposed performance improvement

2020-09-21 Thread Peter Smith

Hi Hackers.

I have a test table with multiple (10) columns defined as TIME WITHOUT
TIME ZONE.

When loading this table with a lot of data (e.g. "COPY tbl FROM
/my/path/2GB.csv WITH (FORMAT CSV)") I observed it was spending an
excessive amount of time within the function GetCurrentDateTime.

IIUC the code is calling GetCurrentDateTime only to acquire the
current TX timestamp as a struct pg_tm in order to derive some
timezone information.

My test table has 10 x TIME columns.
My test data has 22.5 million rows (~ 2GB)
So that's 225 million times the GetCurrentDateTime function is called
to populate the struct with the same values.

I have attached a patch which caches this struct, so now those 225
million calls are reduced to just 1 call.

~

Test Results:

Copy 22.5 million rows data (~ 2GB)

BEFORE
Run 1 = 4m 36s
Run 2 = 4m 30s
Run 3 = 4m 32s
perf showed 20.95% time in GetCurrentDateTime

AFTER (cached struct)
Run 1 = 3m 44s
Run 2 = 3m 44s
Run 3 = 3m 45s
perf shows no time in GetCurrentDateTime
~17% performance improvement in my environment. YMMV.

~

Thoughts?

Kind Regards
Peter Smith.
Fujitsu Australia.


DecodeTimeOnly_GetCurrentDateTime.patch
Description: Binary data

Re: Load TIME fields - proposed performance improvement

Peter Smith  writes:
> IIUC the code is calling GetCurrentDateTime only to acquire the
> current TX timestamp as a struct pg_tm in order to derive some
> timezone information.
> ...
> I have attached a patch which caches this struct, so now those 225
> million calls are reduced to just 1 call.

Interesting idea, but this implementation is leaving a *lot*
on the table.  If we want to cache the result of
timestamp2tm applied to GetCurrentTransactionStartTimestamp(),
there are half a dozen different call sites that could make
use of such a cache, eg, GetSQLCurrentDate and GetSQLCurrentTime.
Applying the caching only in one indirect caller of that seems
pretty narrow-minded.

I'm also strongly suspecting that this implementation is outright
broken, since it's trying to make DecodeTimeOnly's local variable
"tt" into cache storage ... but that variable could be overwritten
with other values, during calls that take the other code path there.
The cache ought to be inside GetCurrentDateTime or something it
calls, and the value needs to be copied to the given output variable.

regards, tom lane

Re: Load TIME fields - proposed performance improvement

I wrote:
> Interesting idea, but this implementation is leaving a *lot*
> on the table.  If we want to cache the result of
> timestamp2tm applied to GetCurrentTransactionStartTimestamp(),
> there are half a dozen different call sites that could make
> use of such a cache, eg, GetSQLCurrentDate and GetSQLCurrentTime.

As an example, I did some quick-and-dirty "perf" measurement of
this case:

create table t1 (id int, d date default current_date);
insert into t1 select generate_series(1,1);

and found that about 10% of the runtime is spent inside timestamp2tm().
Essentially all of that cost could be removed by a suitable caching
patch.  Admittedly, this is a pretty well cherry-picked example, and
more realistic test scenarios might see just a percent or two win.
Still, for the size of the patch I'm envisioning, it'd be well
worth the trouble.

regards, tom lane

Re: shared-memory based stats collector

Hi,

On 2020-09-08 17:55:57 +0900, Kyotaro Horiguchi wrote:
> Locks on the shared statistics is acquired by the units of such like
> tables, functions so the expected chance of collision are not so high.

I can't really parse that...


> Furthermore, until 1 second has elapsed since the last flushing to
> shared stats, lock failure postpones stats flushing so that lock
> contention doesn't slow down transactions.

I think I commented on that before, but to me 1s seems way too low to
switch to blocking lock acquisition. What's the reason for such a low
limit?

>
>   /*
> -  * Clean up any dead statistics collector entries for this DB. We always
> +  * Clean up any dead activity statistics entries for this DB. We always
>* want to do this exactly once per DB-processing cycle, even if we find
>* nothing worth vacuuming in the database.
>*/

What is "activity statistics"?


> @@ -2816,8 +2774,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
>   }
>
>   /* fetch the pgstat table entry */
> - tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
> - 
>  shared, dbentry);
> + tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
> + 
>relid);

Why do all of these places deal with a snapshot? For most it seems to
make much more sense to just look up the entry and then copy that into
local memory?  There may be some place that need some sort of snapshot
behaviour that's stable until commit / pgstat_clear_snapshot(). But I
can't reallly see many?


> +#define PGSTAT_MIN_INTERVAL  1000/* Minimum interval of 
> stats data
>
> +#define PGSTAT_MAX_INTERVAL  1   /* Longest interval of 
> stats data
> + 
>  * updates */

These don't really seem to be in line with the commit message...




>  /*
> - * Structures in which backends store per-table info that's waiting to be
> - * sent to the collector.
> + * Enums and types to define shared statistics structure.
> + *
> + * Statistics entries for each object is stored in individual DSA-allocated
> + * memory. Every entry is pointed from the dshash pgStatSharedHash via
> + * dsa_pointer. The structure makes object-stats entries not moved by dshash
> + * resizing, and allows the dshash can release lock sooner on stats
> + * updates. Also it reduces interfering among write-locks on each stat entry 
> by
> + * not relying on partition lock of dshash. PgStatLocalHashEntry is the
> + * local-stats equivalent of PgStatHashEntry for shared stat entries.
> + *
> + * Each stat entry is enveloped by the type PgStatEnvelope, which stores 
> common
> + * attribute of all kind of statistics and a LWLock lock object.
> + *
> + * Shared stats are stored as:
> + *
> + * dshash pgStatSharedHash
> + *-> PgStatHashEntry (dshash entry)
> + *  (dsa_pointer)-> PgStatEnvelope   (dsa memory block)

I don't like 'Envelope' that much. If I understand you correctly that's
a common prefix that's used for all types of stat objects, correct? If
so, how about just naming it PgStatEntryBase or such? I think it'd also
be useful to indicate in the "are stored as" part that PgStatEnvelope is
just the common prefix for an allocation.


>  /*
> - * pgStatTabHash entry: map from relation OID to PgStat_TableStatus pointer
> + * entry size lookup table of shared statistics entries corresponding to
> + * PgStatTypes
>   */
> -typedef struct TabStatHashEntry
> +static size_t pgstat_entsize[] =

> +/* Ditto for local statistics entries */
> +static size_t pgstat_localentsize[] =
> +{
> + 0,  /* 
> PGSTAT_TYPE_ALL: not an entry */
> + sizeof(PgStat_StatDBEntry), /* PGSTAT_TYPE_DB */
> + sizeof(PgStat_TableStatus), /* PGSTAT_TYPE_TABLE */
> + sizeof(PgStat_BackendFunctionEntry) /* PGSTAT_TYPE_FUNCTION */
> +};

These probably should be const as well.


>  /*
> - * Backends store per-function info that's waiting to be sent to the 
> collector
> - * in this hash table (indexed by function OID).
> + * Stats numbers that are waiting for flushing out to shared stats are held 
> in
> + * pgStatLocalHash,
>   */
> -static HTAB *pgStatFunctions = NULL;
> +typedef struct PgStatHashEntry
> +{
> + PgStatHashEntryKey key; /* hash key */
> + dsa_pointer env;/* pointer to shared stats 
> envelope in DSA */
> +}PgStatHashEntry;
> +
> +/* struct for shared statistics entry pointed from shared hash entry. */
> +typedef struct PgStatEnvelope
> +{
> + PgStatTypes type;   /* statistics entry type */
> + Oid databaseid; /* databaseid

Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.

On Mon, Sep 21, 2020 at 6:58 PM Amit Kapila  wrote:
>
> On Mon, Sep 21, 2020 at 6:27 PM Ashutosh Sharma  wrote:
> >
> > Thanks Dilip for the patch. AFAIU, the fix looks good. One small comment:
> >
>
> Thanks Ashutosh and Dilip for working on this. I'll look into it in a
> day or two.
>

Just a thought:

Should we change the sequence of these #define in
src/include/replication/logicalproto.h?

#define LOGICALREP_PROTO_MIN_VERSION_NUM 1
#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
#define LOGICALREP_PROTO_VERSION_NUM 1
#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM

I would have changed above to something like this:

#define LOGICALREP_PROTO_VERSION_NUM 1
#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2

#define LOGICALREP_PROTO_MIN_VERSION_NUM  LOGICALREP_PROTO_VERSION_NUM
#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: Load TIME fields - proposed performance improvement

2020-09-21 Thread Peter Smith

Hi Tom.

Thanks for your feedback.

On Tue, Sep 22, 2020 at 12:44 PM Tom Lane  wrote:

> Still, for the size of the patch I'm envisioning, it'd be well
> worth the trouble.

The OP patch I gave was just a POC to test the effect and to see if
the idea was judged as worthwhile...

I will rewrite/fix it based on your suggestions.

Kind Regards,
Peter Smith.
Fujitsu Australia.

Re: Index Skip Scan (new UniqueKeys)

2020-09-21 Thread Peter Geoghegan

On Mon, Sep 21, 2020 at 5:59 PM Peter Geoghegan  wrote:
> That's all I have for now.

One more thing. I don't think that this should be a bitwise AND:

if ((offnum > maxoff) & (so->currPos.nextPage == P_NONE))
{

}


-- 
Peter Geoghegan

Re: Parallel Full Hash Join

2020-09-21 Thread Thomas Munro

On Tue, Sep 22, 2020 at 8:49 AM Melanie Plageman
 wrote:
> On Wed, Sep 11, 2019 at 11:23 PM Thomas Munro  wrote:
>> 1.  You could probably make it so that the PHJ_BATCH_SCAN_INNER phase
>> in this patch (the scan for unmatched tuples) is executed by only one
>> process, using the "detach-and-see-if-you-were-last" trick.  Melanie
>> proposed that for an equivalent problem in the looping hash join.  I
>> think it probably works, but it gives up a lot of parallelism and thus
>> won't scale as nicely as the attached patch.
>
> I have attached a patch which implements this
> (v1-0001-Parallel-FOJ-ROJ-single-worker-scan-buckets.patch).

Hi Melanie,

Thanks for working on this!  I have a feeling this is going to be much
easier to land than the mighty hash loop patch.  And it's good to get
one of our blocking design questions nailed down for both patches.

I took it for a very quick spin and saw simple cases working nicely,
but TPC-DS queries 51 and 97 (which contain full joins) couldn't be
convinced to use it.  Hmm.

> For starters, in order to support parallel FOJ and ROJ, I re-enabled
> setting the match bit for the tuples in the hashtable which
> 3e4818e9dd5be294d97c disabled. I did so using the code suggested in [1],
> reading the match bit to see if it is already set before setting it.

Cool.  I'm quite keen to add a "fill_inner" parameter for
ExecHashJoinImpl() and have an N-dimensional lookup table of
ExecHashJoin variants, so that this and much other related branching
can be constant-folded out of existence by the compiler in common
cases, which is why I think this is all fine, but that's for another
day...

> Then, workers except for the last worker detach after exhausting the
> outer side of a batch, leaving one worker to proceed to HJ_FILL_INNER
> and do the scan of the hash table and emit unmatched inner tuples.

+1

Doing better is pretty complicated within our current execution model,
and I think this is a good compromise for now.

Costing for uneven distribution is tricky; depending on your plan
shape, specifically whether there is something else to do afterwards
to pick up the slack, it might or might not affect the total run time
of the query.  It seems like there's not much we can do about that.

> I have also attached a variant on this patch which I am proposing to
> replace it (v1-0001-Parallel-FOJ-ROJ-single-worker-scan-chunks.patch)
> which has a new ExecParallelScanHashTableForUnmatched() in which the
> single worker doing the unmatched scan scans one HashMemoryChunk at a
> time and then frees them as it goes. I thought this might perform better
> than the version which uses the buckets because 1) it should do a bit
> less pointer chasing and 2) it frees each chunk of the hash table as it
> scans it which (maybe) would save a bit of time during
> ExecHashTableDetachBatch() when it goes through and frees the hash
> table, but, my preliminary tests showed a negligible difference between
> this and the version using buckets. I will do a bit more testing,
> though.

+1

I agree that it's the better of those two options.

>> [stuff about deadlocks]
>
> The leader exclusion tactics and the spooling idea don't solve the
> execution order deadlock possibility, so, this "all except last detach
> and last does unmatched inner scan" seems like the best way to solve
> both types of deadlock.

Agreed (at least as long as our threads of query execution are made
out of C call stacks and OS processes that block).

>> Some other notes on the patch:
>>
>> Aside from the deadlock problem, there are some minor details to tidy
>> up (handling of late starters probably not quite right, rescans not
>> yet considered).
>
> These would not be an issue with only one worker doing the scan but
> would have to be handled in a potential new parallel-enabled solution
> like I suggested above.

Makes sense.  Not sure why I thought anything special was needed for rescans.

>> There is a fun hard-coded parameter that controls
>> the parallel step size in terms of cache lines for the unmatched scan;
>> I found that 8 was a lot faster than 4, but no slower than 128 on my
>> laptop, so I set it to 8.
>
> I didn't add this cache line optimization to my chunk scanning method. I
> could do so. Do you think it is more relevant, less relevant, or the
> same if only one worker is doing the unmatched inner scan?

Yeah it's irrelevant for a single process, and even more irrelevant if
we go with your chunk-based version.

>> More thoughts along those micro-optimistic
>> lines: instead of match bit in the header, you could tag the pointer
>> and sometimes avoid having to follow it, and you could prefetch next
>> non-matching tuple's cacheline by looking a head a bit.
>
> I would be happy to try doing this once we get the rest of the patch
> ironed out so that seeing how much of a performance difference it makes
> is more straightforward.

Ignore that, I have no idea if the maintenance overhead for such an
every-tuple-in-this-chain-is-ma

Parallel INSERT (INTO ... SELECT ...)

2020-09-21 Thread Greg Nancarrow

Hi Hackers,

Following on from Dilip Kumar's POC patch for allowing parallelism of
the SELECT part of "INSERT INTO ... SELECT ...", I have attached a POC
patch for allowing parallelism of both the INSERT and SELECT parts,
where it can be allowed.
For cases where it can't be allowed (e.g. INSERT into a table with
foreign keys, or INSERT INTO ... SELECT ... ON CONFLICT ... DO UPDATE
...") it at least allows parallelism of the SELECT part.
Obviously I've had to update the planner and executor and
parallel-worker code to make this happen, hopefully not breaking too
many things along the way.

Examples with patch applied:


(1) non-parallel:

test=# explain analyze insert into primary_tbl select * from third_tbl;
QUERY PLAN
--
 Insert on primary_tbl  (cost=0.00..154.99 rows= width=12) (actual
time=108.445..108.446 rows=0 loops=1)
   ->  Seq Scan on third_tbl  (cost=0.00..154.99 rows= width=12)
(actual time=0.009..5.282 rows= loops=1)
 Planning Time: 0.132 ms
 Execution Time: 108.596 ms
(4 rows)


(2) parallel:

test=# explain analyze insert into primary_tbl select * from third_tbl;
   QUERY PLAN

 Gather  (cost=0.00..16.00 rows= width=12) (actual
time=69.870..70.310 rows=0 loops=1)
   Workers Planned: 5
   Workers Launched: 5
   ->  Parallel Insert on primary_tbl  (cost=0.00..16.00 rows=500
width=12) (actual time=59.948..59.949 rows=0 loops=6)
 ->  Parallel Seq Scan on third_tbl  (cost=0.00..80.00
rows=2500 width=12) (actual time=0.014..0.922 rows=1666 loops=6)
 Planning Time: 0.121 ms
 Execution Time: 70.438 ms
(7 rows)


(3) parallel select only (insert into table with foreign key)

test=# explain analyze insert into secondary_tbl select * from third_tbl;
   QUERY PLAN

 Insert on secondary_tbl  (cost=0.00..80.00 rows= width=12)
(actual time=33.864..33.926 rows=0 loops=1)
   ->  Gather  (cost=0.00..80.00 rows= width=12) (actual
time=0.451..5.201 rows= loops=1)
 Workers Planned: 4
 Workers Launched: 4
 ->  Parallel Seq Scan on third_tbl  (cost=0.00..80.00
rows=2500 width=12) (actual time=0.013..0.717 rows=2000 loops=5)
 Planning Time: 0.127 ms
 Trigger for constraint secondary_tbl_index_fkey: time=331.834 calls=
 Execution Time: 367.342 ms
(8 rows)


Known issues/TODOs:
- Currently only for "INSERT INTO ... SELECT ...". To support "INSERT
INTO ... VALUES ..." would need additional Table AM functions for
dividing up the INSERT work amongst the workers (currently only exists
for scans).
- When INSERTs are made parallel, currently the reported row-count in
the "INSERT 0 " status only reflects the rows that the
leader has processed (not the workers) - so it is obviously less than
the actual number of rows inserted.
- Functions relating to computing the number of parallel workers for
an INSERT, and the cost of an INSERT, need work.
- "force_parallel_mode" handling was updated so that it only affects
SELECT (not INSERT) - can't allow it for INSERT because we're only
supporting "INSERT INTO .. SELECT ..." and don't support other types
of INSERTs, and also can't allow attempted parallel UPDATEs resulting
from "INSERT INTO ... SELECT ... ON CONFLICT ... DO UPDATE" etc.


Thoughts and feedback?

Regards,
Greg Nancarrow
Fujitsu Australia


0001-ParallelInsertSelect.patch
Description: Binary data

Use appendStringInfoString and appendPQExpBufferStr where possible

2020-09-21 Thread Hou, Zhijie

Hi hackers

In(/src/bin/scripts/reindexdb.c; /src/backend/access/rmgrdesc/dbasedesc.c; 
/src/pl/plpython/plpy_elog.c)

I found some more places that should use appendPQExrBufferStr instead of 
appendPQExpBuffer.

Here is the patch.

Previous Discussion:
https://www.postgresql.org/message-id/CAKJS1f9P=M-3ULmPvr8iCno8yvfDViHibJjpriHU8+SXUgeZ=w...@mail.gmail.com

Best regards,
Houzj/huangj




0001-appendPQExpBufferStr-instead-of-appendPQExpBuffer.patch
Description: 0001-appendPQExpBufferStr-instead-of-appendPQExpBuffer.patch


0001-Use-appendStringInfoString-instead-of-appendStringIn.patch
Description: 0001-Use-appendStringInfoString-instead-of-appendStringIn.patch

Re: Dynamic gathering the values for seq_page_cost/xxx_cost

2020-09-21 Thread Andy Fan

Thanks Ashutosh for coming:)

On Mon, Sep 21, 2020 at 9:03 PM Ashutosh Bapat 
wrote:

> On Mon, Sep 21, 2020 at 9:11 AM Andy Fan  wrote:
> >
> > Here are some changes for my detection program.
> >
> > | | seq_read_lat (us) |
> random_read_lat (us) |
> > | FIO |12 |
> 148 |
> > | Previous main.c |   8.5 |
>  74 |
> > | invalidate_device_cache before each testing | 9 |
> 150 |
> > | prepare the test data file with O_DIRECT option |15 |
> 150 |
> >
> > In invalidate_device_cache, I just create another 1GB data file and read
> > it. (see invalidate_device_cache function) this is similar as the
> previous fio setup.
> >
> > prepare test data file with O_DIRECT option means in the past, I prepare
> the test
> > file with buffer IO. and before testing, I do invalidate device cache,
> file
> > system cache. but the buffered prepared file still get better
> performance, I
> > have no idea of it. Since I don't want any cache.  I use O_DIRECT
> > option at last. The seq_read_lat changed from 9us to 15us.
> > I still can't find out the 25% difference with the FIO result. (12 us vs
> 9 us).
> >
> > At last, the random_page_cost happens to not change very much.
> >
> > /u/y/g/fdirect> sudo ./main
> > fs_cache_lat = 0.569031us, seq_read_lat = 18.901749us, random_page_lat =
> 148.650589us
> >
> > cache hit ratio: 1.00 random_page_cost 1.00
> > cache hit ratio: 0.90 random_page_cost 6.401019
> > cache hit ratio: 0.50 random_page_cost 7.663772
> > cache hit ratio: 0.10 random_page_cost 7.841498
> > cache hit ratio: 0.00 random_page_cost 7.864383
> >
> > This result looks much different from "we should use 1.1 ~ 1.5 for SSD".
> >
>
> Very interesting. Thanks for working on this. In an earlier email you
> mentioned that TPCH plans changed to efficient ones when you changed
> random_page_cost = =8.6 from 4 and seq_page_cost was set to 1. IIUC,
> setting random_page_cost to seq_page_cost to the same ratio as that
> between the corresponding latencies improved the plans.


Yes.

How about
> trying this with that ratio set to the one obtained from the latencies
> provided by FIO? Do we see any better plans?
>

My tools set the random_page_cost to 8.6, but based on the fio data, it
should be
set to 12.3 on the same hardware.  and I do see the better plan as well
with 12.3.
Looks too smooth to believe it is true..

The attached result_fio_mytool.tar.gz is my test result.   You can use git
show HEAD^^
is the original plan with 8.6.  git show HEAD^  show the plan changes after
we changed
the random_page_cost. git show HEAD shows the run time statistics changes
for these queries.
I also uploaded the test tool [1] for this for your double check.


| |  8.6 | 12.3 |

|-+--+--|

| Q2  | 2557.064 | 2444.995 |

| Q4  | 3544.606 | 3148.884 |

| Q7  | 2965.820 | 2240.185 |

| Q14 | 4988.747 | 4931.375 |



>
> page cost is one thing, but there are CPU costs also involved in costs
> of expression evaluation. Should those be changed accordingly to the
> CPU latency?
>
> Yes, we need that as well.  At the beginning of this thread, I treat all
of them equally.
In the first reply of Tomas,  he mentioned random_page_cost specially. After
~10 months, I tested TPCH on a hardware and then found random_page_cost
is set too incorrectly, after fixing it,  the result looks much better.  So
I'd like to work
on this special thing first.

[1]
https://github.com/zhihuiFan/tpch-postgres/blob/master/random_page_cost.sh

-- 
Best Regards
Andy Fan


result_fio_mytool.tar.gz
Description: GNU Zip compressed data

Re: Dynamic gathering the values for seq_page_cost/xxx_cost

2020-09-21 Thread Andy Fan

>
>
> It's probably worth testing on various other storage systems to see
>> how that applies to those.
>>
>> Yes, I can test more on new hardware once I get it. Now it is still in
> progress.
> However I can only get a physical machine with SSD or  Virtual machine with
> SSD, other types are hard for me right now.
>
>
Here is a result on a different hardware.   The test method is still not
changed.[1]

Hardware Info:

Virtual Machine with 61GB memory.
Linux Kernel: 5.4.0-31-generic  Ubuntu

# lshw -short -C disk
H/W pathDevice Class  Description
=
/0/100/4/0  /dev/vda   disk   42GB Virtual I/O device
/0/100/5/0  /dev/vdb   disk   42GB Virtual I/O device

The disk on the physical machine is claimed as SSD.

This time the FIO and my tools can generate the exact same result.

fs_cache_lat = 0.957756us, seq_read_lat = 70.780327us, random_page_lat =
438.837257us

cache hit ratio: 1.00 random_page_cost 1.00
cache hit ratio: 0.90 random_page_cost 5.635470
cache hit ratio: 0.50 random_page_cost 6.130565
cache hit ratio: 0.10 random_page_cost 6.192183
cache hit ratio: 0.00 random_page_cost 6.199989

| | seq_read_lat(us) | random_read_lat(us) |
| FIO |   70 | 437 |
| MY Tool |   70 | 438 |

The following query plans have changed because we change random_page_cost
to 4
to 6.2, the Execution time also changed.

| | random_page_cost=4 | random_page_cost=6.2 |
|-++--|
| Q1  |   2561 | 2528.272 |
| Q10 |   4675.749 | 4684.225 |
| Q13 |  18858.048 |18565.929 |
| Q2  |329.279 |  308.723 |
| Q5  |  46248.132 | 7900.173 |
| Q6  |  52526.462 |47639.503 |
| Q7  |  27348.900 |25829.221 |

Q5 improved by 5.8 times and Q6 & Q7 improved by ~10%.

[1]
https://www.postgresql.org/message-id/CAKU4AWpRv50k8E3tC3tiLWGe2DbKaoZricRh_YJ8y_zK%2BHdSjQ%40mail.gmail.com

-- 
Best Regards
Andy Fan

Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.

On Tue, Sep 22, 2020 at 8:34 AM Ashutosh Sharma  wrote:
>
> On Mon, Sep 21, 2020 at 6:58 PM Amit Kapila  wrote:
> >
> > On Mon, Sep 21, 2020 at 6:27 PM Ashutosh Sharma  
> > wrote:
> > >
> > > Thanks Dilip for the patch. AFAIU, the fix looks good. One small comment:
> > >
> >
> > Thanks Ashutosh and Dilip for working on this. I'll look into it in a
> > day or two.
> >
>
> Just a thought:
>
> Should we change the sequence of these #define in
> src/include/replication/logicalproto.h?
>
> #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
> #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
> #define LOGICALREP_PROTO_VERSION_NUM 1
> #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
>
> I would have changed above to something like this:
>
> #define LOGICALREP_PROTO_VERSION_NUM 1
> #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
>
> #define LOGICALREP_PROTO_MIN_VERSION_NUM  LOGICALREP_PROTO_VERSION_NUM
> #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
>

I am not sure if this suggestion makes it better than what is purposed
by Dilip but I think we can declare them in define number order like
below:
#define LOGICALREP_PROTO_MIN_VERSION_NUM 1
#define LOGICALREP_PROTO_VERSION_NUM 1
#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM

Another thing is can we also test by having a publisher of v14 and
subscriber of v13 or prior version, just reverse of what Ashutosh has
tested?

-- 
With Regards,
Amit Kapila.

Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.

On Tue, Sep 22, 2020 at 12:02 PM Amit Kapila  wrote:
>
> On Tue, Sep 22, 2020 at 8:34 AM Ashutosh Sharma  wrote:
> >
> > On Mon, Sep 21, 2020 at 6:58 PM Amit Kapila  wrote:
> > >
> > > On Mon, Sep 21, 2020 at 6:27 PM Ashutosh Sharma  
> > > wrote:
> > > >
> > > > Thanks Dilip for the patch. AFAIU, the fix looks good. One small 
> > > > comment:
> > > >
> > >
> > > Thanks Ashutosh and Dilip for working on this. I'll look into it in a
> > > day or two.
> > >
> >
> > Just a thought:
> >
> > Should we change the sequence of these #define in
> > src/include/replication/logicalproto.h?
> >
> > #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
> > #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
> > #define LOGICALREP_PROTO_VERSION_NUM 1
> > #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
> >
> > I would have changed above to something like this:
> >
> > #define LOGICALREP_PROTO_VERSION_NUM 1
> > #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
> >
> > #define LOGICALREP_PROTO_MIN_VERSION_NUM  LOGICALREP_PROTO_VERSION_NUM
> > #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
> >
>
> I am not sure if this suggestion makes it better than what is purposed
> by Dilip but I think we can declare them in define number order like
> below:
> #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
> #define LOGICALREP_PROTO_VERSION_NUM 1
> #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
> #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
>

The only reason I proposed that was because for the *_MAX_VERSION_NUM
we are using the latest PROTOCOL version name in its definition so why
not to do the same for defining *_MIN_VERSION_NUM as well. Other than
that, I also wanted to correct the sequence so that they are defined
in the increasing order which you have already done here.

> Another thing is can we also test by having a publisher of v14 and
> subscriber of v13 or prior version, just reverse of what Ashutosh has
> tested?
>

I've tested LR from PGv12 (Publisher) to PGv14 (Subscriber) and it works fine.

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.