Re: [HACKERS] logical decoding of two-phase transactions
> Why can't we call ReorderBufferCleanupTXN() from > ReorderBufferFinishPrepared after your changes? > Since the truncate already removed the changes, it would fail on the below Assert in ReorderBufferCleanupTXN() /* Check we're not mixing changes from different transactions. */ Assert(change->txn == txn); regards. Ajin Cherian Fujitsu Australia
Re: VACUUM PARALLEL option vs. max_parallel_maintenance_workers
On 2020-09-21 05:48, Amit Kapila wrote: What according to you should be the behavior here and how will it be better than current? I think if I write VACUUM (PARALLEL 5), it should use up to 5 workers (up to the number of indexes), even if max_parallel_maintenance_workers is 2. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Yet another fast GiST build
On 21/09/2020 02:06, Tom Lane wrote: Justin Pryzby writes: This also appears to break checksums. Thanks, I'll go fix it. I was wondering about that, because the typical pattern for use of smgrextend for indexes seems to be RelationOpenSmgr(rel); PageSetChecksumInplace(page, lastblock); smgrextend(rel->rd_smgr, MAIN_FORKNUM, lastblock, zerobuf.data, false); and gist_indexsortbuild wasn't doing either of the first two things. gist_indexsortbuild_flush_ready_pages looks like it might be a few bricks shy of a load too. But my local CLOBBER_CACHE_ALWAYS run hasn't gotten to anything except the pretty-trivial index made in point.sql, so I don't have evidence about it. I don't think a relcache invalidation can happen on the index we're building. Other similar callers call RelationOpenSmgr(rel) before every write though (e.g. _bt_blwritepage()), so perhaps it's better to copy that pattern here too. Another interesting point is that all the other index AMs seem to WAL-log the new page before the smgrextend call, whereas this code is doing it in the other order. I strongly doubt that both patterns are equally correct. Could be that the other AMs are in the wrong though. My thinking was that it's better to call smgrextend() first, so that if you run out of disk space, you get the error before WAL-logging it. That reduces the chance that WAL replay will run out of disk space. A lot of things are different during WAL replay, so it's quite likely that WAL replay runs out of disk space anyway if you're living on the edge, but still. I didn't notice that the other callers are doing it the other way round, though. I think they need to, so that they can stamp the page with the LSN of the WAL record. But GiST build is special in that regard, because it stamps all pages with GistBuildLSN. - Heikki
Re: Yet another fast GiST build
On 21/09/2020 11:08, Heikki Linnakangas wrote: I think they need to, so that they can stamp the page with the LSN of the WAL record. But GiST build is special in that regard, because it stamps all pages with GistBuildLSN. Actually, don't we have a problem with that, even before this patch? Even though we set the LSN to the magic GistBuildLSN value when we build the index, WAL replay will write the LSN of the record instead. That would mess with the LSN-NSN interlock. After WAL replay (or in a streaming replica), a scan on the GiST index might traverse right-links unnecessarily. - Heikki
Re: Binaries on s390x arch
Re: Namrata Bhave > As seen from downloads page, the Apt repo/rpms are not yet available for > s390x for latest versions. Hi, are you asking about apt (.deb) or yum (.rpm) packages? > Wanted to know if there is any work going on/planned to provide Postgres in > ready-to-use package or installer form for s390x architecture? s390x is fully supported as a Debian release architecture, so you can get PostgreSQL and all packaged modules there (PG11 on Debian buster). The apt team doesn't have plans for s390x yet on the PostgreSQL side. If you are interested, are you in a position to donate a build host? Christoph
Re: Yet another fast GiST build
> 21 сент. 2020 г., в 13:45, Heikki Linnakangas написал(а): > > On 21/09/2020 11:08, Heikki Linnakangas wrote: >> I think they need to, so that they can stamp the page with the LSN of >> the WAL record. But GiST build is special in that regard, because it >> stamps all pages with GistBuildLSN. > > Actually, don't we have a problem with that, even before this patch? Even > though we set the LSN to the magic GistBuildLSN value when we build the > index, WAL replay will write the LSN of the record instead. That would mess > with the LSN-NSN interlock. After WAL replay (or in a streaming replica), a > scan on the GiST index might traverse right-links unnecessarily. I think we don't set rightlinks during index build. Best regards, Andrey Borodin.
Re: Is deduplicate_items ever used in nbtree?
Hi, On Mon, Sep 21, 2020 at 2:21 PM Nikolay Shaplov wrote: > > Hi! > > I am still working with reloptions patches, and noticed that in current master > deduplicate_items option exists, but never actually used. > You can set, it, you can dump it, it has proper representation in BTOptions, > it is mentioned in documentation, but it is never actually used in the code. > Or at least I did not managed to grep it. > > Is it the way it should be, or something needs fixing? It's used via the BTGetDeduplicateItems() macro.
Logical replication from PG v13 and below to PG v14 (devel version) is not working.
Hi All, Today, while exploring logical replication in PostgreSQL, I noticed that logical replication from PG version 13 and below to PG v14 (development version) is not working. It has stopped working from the following git commit onwards: commit 464824323e57dc4b397e8b05854d779908b55304 Author: Amit Kapila Date: Thu Sep 3 07:54:07 2020 +0530 Add support for streaming to built-in logical replication. ... ... Here is the experiment that I performed to verify this: Publisher (PG-v12/13): == CREATE TABLE pg2pg (id serial PRIMARY KEY, whatever text); INSERT INTO pg2pg (whatever) SELECT 'str-' || i FROM generate_series(1, 5) i; SELECT * FROM pg2pg; CREATE PUBLICATION pg2pg_pub FOR TABLE pg2pg; Subscriber (PG-v14 HEAD or commit 46482432): = CREATE TABLE pg2pg (id serial PRIMARY KEY, whatever text); CREATE SUBSCRIPTION pg2pg_sub CONNECTION 'host=127.0.0.1 port=5433 dbname=postgres user=ashu' PUBLICATION pg2pg_pub; SELECT * FROM pg2pg; Above select query produces no result. When this experiment is performed below the mentioned git commit, it works fine. After spending some time looking into this issue, I observed that above git commit has bumped the logical replication protocol version. Due to this, the logical replication apply worker process is unable to do WAL streaming which causes it to terminate. Therefore, the data inserted in the publication table is not applied on the subscription table (i.e. no data replication happens) -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: [HACKERS] logical decoding of two-phase transactions
On Mon, Sep 21, 2020 at 12:36 PM Ajin Cherian wrote: > > > Why can't we call ReorderBufferCleanupTXN() from > > ReorderBufferFinishPrepared after your changes? > > > > Since the truncate already removed the changes, it would fail on the > below Assert in ReorderBufferCleanupTXN() > /* Check we're not mixing changes from different transactions. */ > Assert(change->txn == txn); > The changes list should be empty by that time because we removing each change from the list:, see code "dlist_delete(&change->node);" in ReorderBufferTruncateTXN. If you are hitting the Assert as you mentioned then I think the problem is something else. -- With Regards, Amit Kapila.
Re: [HACKERS] logical decoding of two-phase transactions
On Mon, Sep 21, 2020 at 10:20 AM Amit Kapila wrote: > > On Sun, Sep 20, 2020 at 11:01 AM Dilip Kumar wrote: > > > > On Fri, Sep 18, 2020 at 6:02 PM Ajin Cherian wrote: > > > > > > > 3. > > > > + /* > > + * If it's ROLLBACK PREPARED then handle it via callbacks. > > + */ > > + if (TransactionIdIsValid(xid) && > > + !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) && > > + parsed->dbId == ctx->slot->data.database && > > + !FilterByOrigin(ctx, origin_id) && > > + ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid)) > > + { > > + ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr, > > + commit_time, origin_id, origin_lsn, > > + parsed->twophase_gid, false); > > + return; > > + } > > > > > > I think we have already checked !SnapBuildXactNeedsSkip, parsed->dbId > > == ctx->slot->data.database and !FilterByOrigin in DecodePrepare > > so if those are not true then we wouldn't have prepared this > > transaction i.e. ReorderBufferTxnIsPrepared will be false so why do we > > need > > to recheck this conditions. > > > > Yeah, probably we should have Assert for below three conditions: > + !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) && > + parsed->dbId == ctx->slot->data.database && > + !FilterByOrigin(ctx, origin_id) && +1 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] logical decoding of two-phase transactions
On Sun, Sep 20, 2020 at 3:31 PM Dilip Kumar wrote: > + /* > + * If it's ROLLBACK PREPARED then handle it via callbacks. > + */ > + if (TransactionIdIsValid(xid) && > + !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) && > + parsed->dbId == ctx->slot->data.database && > + !FilterByOrigin(ctx, origin_id) && > + ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid)) > + { > + ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr, > + commit_time, origin_id, origin_lsn, > + parsed->twophase_gid, false); > + return; > + } > > > I think we have already checked !SnapBuildXactNeedsSkip, parsed->dbId > == ctx->slot->data.database and !FilterByOrigin in DecodePrepare > so if those are not true then we wouldn't have prepared this > transaction i.e. ReorderBufferTxnIsPrepared will be false so why do we > need > to recheck this conditions. We could enter DecodeAbort even without a prepare, as the code is common for both XLOG_XACT_ABORT and XLOG_XACT_ABORT_PREPARED. So, the conditions !SnapBuildXactNeedsSkip, parsed->dbId > == ctx->slot->data.database and !FilterByOrigin could be true but the > transaction is not prepared, then we dont need to do a > ReorderBufferFinishPrepared (with commit flag false) but called > ReorderBufferAbort. But I think there is a problem, if those conditions are > in fact false, then we should return without trying to Abort using > ReorderBufferAbort, what do you think? I agree with all your other comments. regards, Ajin Fujitsu Australia
Re: VACUUM PARALLEL option vs. max_parallel_maintenance_workers
On Mon, Sep 21, 2020 at 12:45 PM Peter Eisentraut wrote: > > On 2020-09-21 05:48, Amit Kapila wrote: > > What according to you should be the behavior here and how will it be > > better than current? > > I think if I write VACUUM (PARALLEL 5), it should use up to 5 workers > (up to the number of indexes), even if max_parallel_maintenance_workers > is 2. > So you want it to disregard max_parallel_maintenance_workers but all parallel operations have to regard one of the max_parallel_* option so that it can respect max_parallel_workers beyond which the system won't allow more parallel workers. Now, if we won't respect one of the max_parallel_* option, it will unnecessarily try to register those many workers even though it won't be able to start those many workers. I think it is better to keep the limit for workers for scans and maintenance operations separately so that the user is allowed to perform different parallel operations in the system. -- With Regards, Amit Kapila.
Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.
On Mon, Sep 21, 2020 at 3:26 PM Ashutosh Sharma wrote: > > commit 464824323e57dc4b397e8b05854d779908b55304 > Author: Amit Kapila > Date: Thu Sep 3 07:54:07 2020 +0530 > > Above select query produces no result. When this experiment is > performed below the mentioned git commit, it works fine. > I encountered the same issue. We get to see below error during table sync phase in the subscriber server logs. 2020-09-21 16:01:45.794 IST [782428] LOG: logical replication apply worker for subscription "pg2pg_sub" has started 2020-09-21 16:01:45.799 IST [782428] ERROR: could not start WAL streaming: ERROR: client sent proto_version=2 but we only support protocol 1 or lower CONTEXT: slot "pg2pg_sub", output plugin "pgoutput", in the startup callback 2020-09-21 16:01:45.800 IST [781607] LOG: background worker "logical replication worker" (PID 782428) exited with exit code 1 > > After spending some time looking into this issue, I observed that > above git commit has bumped the logical replication protocol version. > Due to this, the logical replication apply worker process is unable to > do WAL streaming which causes it to terminate. Therefore, the data > inserted in the publication table is not applied on the subscription > table (i.e. no data replication happens) > That's because the above commit changed LOGICALREP_PROTO_VERSION_NUM to 2 from 1. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.
On Mon, Sep 21, 2020 at 3:26 PM Ashutosh Sharma wrote: > > Hi All, > > Today, while exploring logical replication in PostgreSQL, I noticed > that logical replication from PG version 13 and below to PG v14 > (development version) is not working. It has stopped working from the > following git commit onwards: > > commit 464824323e57dc4b397e8b05854d779908b55304 > Author: Amit Kapila > Date: Thu Sep 3 07:54:07 2020 +0530 > > Add support for streaming to built-in logical replication. > > ... > ... > > Here is the experiment that I performed to verify this: > > Publisher (PG-v12/13): > == > CREATE TABLE pg2pg (id serial PRIMARY KEY, whatever text); > > INSERT INTO pg2pg (whatever) SELECT 'str-' || i FROM generate_series(1, 5) i; > > SELECT * FROM pg2pg; > > CREATE PUBLICATION pg2pg_pub FOR TABLE pg2pg; > > Subscriber (PG-v14 HEAD or commit 46482432): > = > CREATE TABLE pg2pg (id serial PRIMARY KEY, whatever text); > > CREATE SUBSCRIPTION pg2pg_sub CONNECTION 'host=127.0.0.1 port=5433 > dbname=postgres user=ashu' PUBLICATION pg2pg_pub; > > SELECT * FROM pg2pg; > > Above select query produces no result. When this experiment is > performed below the mentioned git commit, it works fine. > > After spending some time looking into this issue, I observed that > above git commit has bumped the logical replication protocol version. > Due to this, the logical replication apply worker process is unable to > do WAL streaming which causes it to terminate. Therefore, the data > inserted in the publication table is not applied on the subscription > table (i.e. no data replication happens) Seems like this commit, should have only set the LOGICALREP_PROTO_STREAM_VERSION_NUM to 2 but the LOGICALREP_PROTO_VERSION_NUM shouln't have been changed. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [POC] Fast COPY FROM command for the table with foreign partitions
This patch currently looks very ready for use. And I'm taking a close look at the error reporting. Here we have difference in behavior of local and foreign table: regression test in postgres_fdw.sql: copy rem2 from stdin; -1 xyzzy \. reports error (1): = ERROR: new row for relation "loc2" violates check constraint... DETAIL: Failing row contains (-1, xyzzy). CONTEXT: COPY loc2, line 1: "-1 xyzzy" remote SQL command: COPY public.loc2(f1, f2) FROM STDIN COPY rem2, line 2 But local COPY into loc2 reports another error (2): === copy loc2 from stdin; ERROR: new row for relation "loc2" violates check constraint... DETAIL: Failing row contains (-1, xyzzy). CONTEXT: COPY loc2, line 1: "-1 xyzzy" Report (2) is shorter and more specific. Report (1) contains meaningless information. Maybe we need to improve error report? For example like this: ERROR: Failed COPY into foreign table "rem2": new row for relation "loc2" violates check constraint... DETAIL: Failing row contains (-1, xyzzy). remote SQL command: COPY public.loc2(f1, f2) FROM STDIN COPY rem2, line 1 The problem here is that we run into an error after the COPY FROM command completes. And we need to translate lineno from foreign server to lineno of overall COPY command. -- regards, Andrey Lepikhov Postgres Professional
Re: [HACKERS] logical decoding of two-phase transactions
On Mon, Sep 21, 2020 at 3:45 PM Ajin Cherian wrote: > > On Sun, Sep 20, 2020 at 3:31 PM Dilip Kumar wrote: > > > + /* > > + * If it's ROLLBACK PREPARED then handle it via callbacks. > > + */ > > + if (TransactionIdIsValid(xid) && > > + !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) && > > + parsed->dbId == ctx->slot->data.database && > > + !FilterByOrigin(ctx, origin_id) && > > + ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid)) > > + { > > + ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr, > > + commit_time, origin_id, origin_lsn, > > + parsed->twophase_gid, false); > > + return; > > + } > > > > > > I think we have already checked !SnapBuildXactNeedsSkip, parsed->dbId > > == ctx->slot->data.database and !FilterByOrigin in DecodePrepare > > so if those are not true then we wouldn't have prepared this > > transaction i.e. ReorderBufferTxnIsPrepared will be false so why do we > > need > > to recheck this conditions. > > We could enter DecodeAbort even without a prepare, as the code is > common for both XLOG_XACT_ABORT and XLOG_XACT_ABORT_PREPARED. So, the > conditions !SnapBuildXactNeedsSkip, parsed->dbId > > == ctx->slot->data.database and !FilterByOrigin could be true but the > > transaction is not prepared, then we dont need to do a > > ReorderBufferFinishPrepared (with commit flag false) but called > > ReorderBufferAbort. But I think there is a problem, if those conditions are > > in fact false, then we should return without trying to Abort using > > ReorderBufferAbort, what do you think? > I think we need to call ReorderBufferAbort at least to clean up the TXN. Also, if what you are saying is correct then that should be true without this patch as well, no? If so, we don't need to worry about it as far as this patch is concerned. -- With Regards, Amit Kapila.
Re: pg_service.conf file with unknown parameters
> On 11 Sep 2020, at 14:39, Magnus Hagander wrote: > For example, if I have a service file with gssencmode=disable set, that > service file cannot be used by a psql client linked against libpq from > version 10. Even if the behavior would be identical (since v10 doesn't have > gssencmode). > > Is there a particular reason we (1) refuse unknown parameters in the file, The above case is a good example when silently ignoring would be beneficial. We would however run the risk that someone has this in their service which is silently ignored and fails to notice: ssl_mim_protocol_version=TLSv1.3 Not sure if that's worth the risk? Halting on unknown parameters is also consistent with postgresql.conf parsing etc (which isn't a clientside file I know, but still relevant IMO). > and (2) call it a "syntax error"? That I agree with isn't entirely correct, the syntax is correct but the parameter is unknown. Something along the following seems more correct: - libpq_gettext("syntax error in service file \"%s\", line %d\n"), + libpq_gettext("unknown parameter in service file \"%s\", line %d\n"), > The documentation just says it's "INI format" file -- though in my experience > most other INI file parsers just ignore extra parameters included.. I don't think the INI file format is formally defined anywhere, but I do believe it's considered to be strictly key-values (Wikipedia [0] agrees with that). Since we allow ldap configurations like the one below, it's technically not INI format though: [service=mydb] ldap://127.0.0.1:1/foo?bar?lorem?ipsum That might be borderline hairsplitting, but perhaps calling it INI format in the docs isn't really helping? Maybe we should reword that to say key/value or something similar? And this brings up an even more interesting case, the above will yield a syntax error if libpq wasn't compiled with LDAP support, as opposed to other parameters (like SSL* etc) which are ignored for builds not supporting them. Is there a reason to treat ldap differently? cheers ./daniel [0] https://en.wikipedia.org/wiki/INI_file
Re: Planner, check if can use consider HASH for groupings (src/backend/optimizer/plan/planner.c)
Em dom., 20 de set. de 2020 às 21:10, Tomas Vondra < tomas.von...@2ndquadrant.com> escreveu: > On Sun, Sep 20, 2020 at 08:09:56PM -0300, Ranier Vilela wrote: > >Em sex., 18 de set. de 2020 às 10:37, Tomas Vondra < > >tomas.von...@2ndquadrant.com> escreveu: > > > >> On Thu, Sep 17, 2020 at 06:31:12PM -0300, Ranier Vilela wrote: > >> >Em qui., 17 de set. de 2020 às 17:09, Tomas Vondra < > >> >tomas.von...@2ndquadrant.com> escreveu: > >> > > >> >> On Thu, Sep 17, 2020 at 02:12:12PM -0300, Ranier Vilela wrote: > >> >> >Hi, > >> >> > > >> >> >In case gd->any_hashable is FALSE, grouping_is_hashable is never > >> called. > >> >> >In this case, the planner could use HASH for groupings, but will > never > >> >> know. > >> >> > > >> >> > >> >> The condition is pretty simple - if the query has grouping sets, > look at > >> >> grouping sets, otherwise look at groupClause. For this to be an issue > >> >> the query would need to have both grouping sets and (independent) > group > >> >> clause - which AFAIK is not possible. > >> >> > >> >Uh? > >> >(parse->groupClause != NIL) If different from NIL we have > ((independent) > >> >group clause), grouping_is_hashable should check? > >> >(gd ? gd->any_hashable : grouping_is_hashable(parse->groupClause > >> >If gd is not NIL and gd->any_hashtable is FALSE? > >> > > >> > >> Sorry, I'm not quite sure I understand what this is meant to say :-( > >> > >> Anyway, (groupClause != NIL) does not mean the groupClause is somehow > >> independent (of what?). Add some debugging to create_grouping_paths and > >> you'll see that e.g. this query ends up with groupClause with 3 items: > >> > >> select 1 from t group by grouping sets ((a), (b), (c)); > >> > >> and so does this one: > >> > >> select 1 from t group by grouping sets ((a,c), (a,b)); > >> > >> I'm no expert in grouping sets, but I'd bet this means we transform the > >> grouping sets into a groupClause by extracting the keys. I haven't > >> investigated why exactly we do this, but I'm sure there's a reason (e.g. > >> it gives us SortGroupClause). > >> > >> You seem to believe a query can have both grouping sets and regular > >> grouping at the same level - but how would such query look like? Surely > >> you can't have two GROuP BY clauses. You can do > >> > >Translating into code: > >gd is grouping sets and > >parse->groupClause is regular grouping > >and we cannot have both at the same time. > > > > Have you verified the claim that we can't have both at the same time? As > I already explained, we build groupClause from grouping sets. I suggest > you do some debugging using the queries I used as examples, and you'll > see the claim is not true. > I think we already agreed that we cannot have both at the same time. > > > > >> select 1 from t group by a, grouping sets ((b), (c)); > >> > >> which is however mostly equivalent to (AFAICS) > >> > >> select 1 from t group by grouping sets ((a,b), (a,c)) > >> > >> so it's not like we have an independent groupClause either I think. > >> > >> The whole point of the original condition is - if there are grouping > >> sets, check if at least one can be executed using hashing (i.e. all keys > >> are hashable). Otherwise (without grouping sets) look at the grouping as > >> a whole. > >> > >Or if gd is NULL check parse->groupClause. > > > > Which is what I'm saying. > > > > >> I don't see how your change improves this - if there are grouping sets, > >> it's futile to look at the whole groupClause if at least one grouping > >> set can't be hashed. > >> > >> But there's a simple way to disprove this - show us a case (table and a > >> query) where your code correctly allows hashing while the current one > >> does not. > > > >I'm not an expert in grouping either. > >The question I have here is whether gd is populated and has gd-> > >any_hashable as FALSE, > >Its mean no use checking parse-> groupClause, it's a waste of time, Ok. > > > > I'm sorry, I don't follow your logic. Those are two separate cases. If > we have grouping sets, we have to check if at least one can be hashed. > If we don't have grouping sets, we have to check groupClause directly. > Why would that be a waste of time is unclear to me. > This is clear to me. The problem is how it was implemented in create_grouping_paths. > > > > >> > >> > > >> >> For hashing to be worth considering, at least one grouping set has > to be > >> >> hashable - otherwise it's pointless. > >> >> > >> >> Granted, you could have something like > >> >> > >> >> GROUP BY GROUPING SETS ((a), (b)), c > >> >> > >> >> which I think essentially says "add c to every grouping set" and that > >> >> will be covered by the any_hashable check. > >> >> > >> >I am not going into the merit of whether or not it is worth hashing. > >> >grouping_is_hashable as a last resort, must decide. > >> > > >> > >> I don't know what this is supposed to say either. The whole point of > >> this check is to simply skip construction of hash-based paths in cases
Re: [HACKERS] logical decoding of two-phase transactions
On Mon, Sep 21, 2020 at 9:24 PM Amit Kapila wrote: > I think we need to call ReorderBufferAbort at least to clean up the > TXN. Also, if what you are saying is correct then that should be true > without this patch as well, no? If so, we don't need to worry about it > as far as this patch is concerned. Yes, that is true. So will change this check to: if (TransactionIdIsValid(xid) && ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid) regards, Ajin Cherian Fujitsu Australia
Re: make MaxBackends available in _PG_init
On Mon, Sep 21, 2020 at 9:14 AM Wang, Shenhao wrote: > > In source, I find that the postmaster will first load library, and then > calculate the value of MaxBackends. > > In the old version, the MaxBackends was calculated by: > MaxBackends = MaxConnections + autovacuum_max_workers + 1 + > GetNumShmemAttachedBgworkers(); > Because any extension can register workers which will affect the return value > of GetNumShmemAttachedBgworkers. > InitializeMaxBackends must be called after shared_preload_libraries. This is > also mentioned in comments. > > Now, function GetNumShmemAttachedBgworkers was deleted and replaced by guc > max_worker_processes, > so if we changed the calling order like: > Step1: calling InitializeMaxBackends. > Step2: calling process_shared_preload_libraries > Yes, the GetNumShmemAttachedBgworkers() was removed by commit # dfbba2c86cc8f09cf3ffca3d305b4ce54a7fb49a. ASAICS, changing the order of InitializeMaxBackends() and process_shared_preload_libraries() has no problem, as InitializeMaxBackends() doesn't calculate the MaxBackends based on bgworker infra code, it does calculate based on GUCs. Having said that, I'm not quite sure whether any of the bgworker registration code, for that matter process_shared_preload_libraries() code path will somewhere use MaxBackends? > > In this order extension can get the correct value of MaxBackends in _PG_init. > Is there any specific use case that any of the _PG_init will use MaxBackends? I think the InitializeMaxBackends() function comments still say as shown below. * This must be called after modules have had the chance to register background * workers in shared_preload_libraries, and before shared memory size is * determined. What happens with your patch in EXEC_BACKEND cases? (On linux EXEC_BACKEND (include/pg_config_manual.h) can be enabled for testing purposes). With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] logical decoding of two-phase transactions
On Mon, Sep 21, 2020 at 5:23 PM Ajin Cherian wrote: > > On Mon, Sep 21, 2020 at 9:24 PM Amit Kapila wrote: > > > I think we need to call ReorderBufferAbort at least to clean up the > > TXN. Also, if what you are saying is correct then that should be true > > without this patch as well, no? If so, we don't need to worry about it > > as far as this patch is concerned. > > Yes, that is true. So will change this check to: > > if (TransactionIdIsValid(xid) && > ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid) > Yeah and add the Assert for skip conditions as asked above. -- With Regards, Amit Kapila.
Lift line-length limit for pg_service.conf
The pg_service.conf parsing thread [0] made me realize that we have a hardwired line length of max 256 bytes. Lifting this would be in line with recent work for ecpg, pg_regress and pg_hba (784b1ba1a2 and 8f8154a50). The attached moves pg_service.conf to use the new pg_get_line_append API and a StringInfo to lift the restriction. Any reason not to do that while we're lifting other such limits? cheers ./daniel 0001-Remove-arbitrary-line-length-in-pg_service.conf-pars.patch Description: Binary data [0] CABUevEw_RRRgG2uwsO7CD0TQf+Z=oR=S1=QyifV8D_5hatJ=o...@mail.gmail.com
Re: Yet another fast GiST build
On 21/09/2020 02:06, Tom Lane wrote: Justin Pryzby writes: This also appears to break checksums. Fixed, thanks for the report! I was wondering about that, because the typical pattern for use of smgrextend for indexes seems to be RelationOpenSmgr(rel); PageSetChecksumInplace(page, lastblock); smgrextend(rel->rd_smgr, MAIN_FORKNUM, lastblock, zerobuf.data, false); and gist_indexsortbuild wasn't doing either of the first two things. gist_indexsortbuild_flush_ready_pages looks like it might be a few bricks shy of a load too. But my local CLOBBER_CACHE_ALWAYS run hasn't gotten to anything except the pretty-trivial index made in point.sql, so I don't have evidence about it. I added a RelationOpenSmgr() call there too, although it's not needed currently. It seems to be enough to do it before the first smgrextend() call. But if you removed or refactored the first call someohow, so it was not the first call anymore, it would be easy to miss that you'd still need the RelationOpenSmgr() call there. It's more consistent with the code in nbtsort.c now, too. - Heikki
Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.
On Mon, Sep 21, 2020 at 4:15 PM Dilip Kumar wrote: > > On Mon, Sep 21, 2020 at 3:26 PM Ashutosh Sharma wrote: > > > > Hi All, > > > > Today, while exploring logical replication in PostgreSQL, I noticed > > that logical replication from PG version 13 and below to PG v14 > > (development version) is not working. It has stopped working from the > > following git commit onwards: > > > > commit 464824323e57dc4b397e8b05854d779908b55304 > > Author: Amit Kapila > > Date: Thu Sep 3 07:54:07 2020 +0530 > > > > Add support for streaming to built-in logical replication. > > > > ... > > ... > > > > Here is the experiment that I performed to verify this: > > > > Publisher (PG-v12/13): > > == > > CREATE TABLE pg2pg (id serial PRIMARY KEY, whatever text); > > > > INSERT INTO pg2pg (whatever) SELECT 'str-' || i FROM generate_series(1, 5) > > i; > > > > SELECT * FROM pg2pg; > > > > CREATE PUBLICATION pg2pg_pub FOR TABLE pg2pg; > > > > Subscriber (PG-v14 HEAD or commit 46482432): > > = > > CREATE TABLE pg2pg (id serial PRIMARY KEY, whatever text); > > > > CREATE SUBSCRIPTION pg2pg_sub CONNECTION 'host=127.0.0.1 port=5433 > > dbname=postgres user=ashu' PUBLICATION pg2pg_pub; > > > > SELECT * FROM pg2pg; > > > > Above select query produces no result. When this experiment is > > performed below the mentioned git commit, it works fine. > > > > After spending some time looking into this issue, I observed that > > above git commit has bumped the logical replication protocol version. > > Due to this, the logical replication apply worker process is unable to > > do WAL streaming which causes it to terminate. Therefore, the data > > inserted in the publication table is not applied on the subscription > > table (i.e. no data replication happens) > > Seems like this commit, should have only set the > LOGICALREP_PROTO_STREAM_VERSION_NUM to 2 but the > LOGICALREP_PROTO_VERSION_NUM shouln't have been changed. I think if we don't increase the LOGICALREP_PROTO_VERSION_NUM, then streaming will not work in the latest version. So what I feel is that we can keep the LOGICALREP_PROTO_VERSION_NUM as 1 only add one more parameter say LOGICALREP_PROTO_MAX_VERSION_NUM. So now from the latest subscriber if streaming is enabled then we can send LOGICALREP_PROTO_STREAM_VERSION_NUM and LOGICALREP_PROTO_VERSION_NUM otherwise. And on publisher side we can check with the max_protocol_version and min_protocol version. I have attached a patch for the changes I have explained. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com v1-0001-Bugfix-in-logical-protocol-version.patch Description: Binary data
Re: Yet another fast GiST build
On 21/09/2020 12:06, Andrey M. Borodin wrote: 21 сент. 2020 г., в 13:45, Heikki Linnakangas написал(а): Actually, don't we have a problem with that, even before this patch? Even though we set the LSN to the magic GistBuildLSN value when we build the index, WAL replay will write the LSN of the record instead. That would mess with the LSN-NSN interlock. After WAL replay (or in a streaming replica), a scan on the GiST index might traverse right-links unnecessarily. I think we don't set rightlinks during index build. The new GiST sorting code does not, but the regular insert-based code does. That's a bit questionable in the new code actually. Was that a conscious decision? The right-links are only needed when there are concurrent page splits, so I think it's OK, but the checks for InvalidBlockNumber in gistScanPage() and gistFindPage() have comment "/* sanity check */". Comment changes are needed, at least. - Heikki
Re: PATCH: Batch/pipelining support for libpq
Alvaro, On Fri, 4 Sep 2020 at 17:26, Alvaro Herrera wrote: > On 2020-Aug-31, Matthieu Garrigues wrote: > > > It seems like this patch is nearly finished. I fixed all the remaining > > issues. I'm also asking a confirmation of the test scenarios you want > > to see in the next version of the patch. > > Hmm, apparently nobody noticed that this patch is not registered in the > current commitfest. > > Since it was submitted ahead of the deadline, I'm going to add it there. > (The patch has also undergone a lot of review already; it's certainly > not a newcomer.) > I am looking for this in the commitfest and can't find it. However there is an old commitfest entry https://commitfest.postgresql.org/13/1024/ Do you have the link for the new one ? Dave Cramer www.postgres.rocks > > >
Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.
Thanks Dilip for the patch. AFAIU, the fix looks good. One small comment: In the error message we are still referring to the native protocol version number. Shouldn't it be replaced with the greatest protocol version number we support now (i.e. LOGICALREP_PROTO_MAX_VERSION_NUM)? - if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM) + if (data->protocol_version > LOGICALREP_PROTO_MAX_VERSION_NUM) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("client sent proto_version=%d but we only support protocol %d or lower", data->protocol_version, LOGICALREP_PROTO_VERSION_NUM))); Other than this, I don't have any comments. On Mon, Sep 21, 2020 at 5:34 PM Dilip Kumar wrote: > > On Mon, Sep 21, 2020 at 4:15 PM Dilip Kumar wrote: > > > > On Mon, Sep 21, 2020 at 3:26 PM Ashutosh Sharma > > wrote: > > > > > > Hi All, > > > > > > Today, while exploring logical replication in PostgreSQL, I noticed > > > that logical replication from PG version 13 and below to PG v14 > > > (development version) is not working. It has stopped working from the > > > following git commit onwards: > > > > > > commit 464824323e57dc4b397e8b05854d779908b55304 > > > Author: Amit Kapila > > > Date: Thu Sep 3 07:54:07 2020 +0530 > > > > > > Add support for streaming to built-in logical replication. > > > > > > ... > > > ... > > > > > > Here is the experiment that I performed to verify this: > > > > > > Publisher (PG-v12/13): > > > == > > > CREATE TABLE pg2pg (id serial PRIMARY KEY, whatever text); > > > > > > INSERT INTO pg2pg (whatever) SELECT 'str-' || i FROM generate_series(1, > > > 5) i; > > > > > > SELECT * FROM pg2pg; > > > > > > CREATE PUBLICATION pg2pg_pub FOR TABLE pg2pg; > > > > > > Subscriber (PG-v14 HEAD or commit 46482432): > > > = > > > CREATE TABLE pg2pg (id serial PRIMARY KEY, whatever text); > > > > > > CREATE SUBSCRIPTION pg2pg_sub CONNECTION 'host=127.0.0.1 port=5433 > > > dbname=postgres user=ashu' PUBLICATION pg2pg_pub; > > > > > > SELECT * FROM pg2pg; > > > > > > Above select query produces no result. When this experiment is > > > performed below the mentioned git commit, it works fine. > > > > > > After spending some time looking into this issue, I observed that > > > above git commit has bumped the logical replication protocol version. > > > Due to this, the logical replication apply worker process is unable to > > > do WAL streaming which causes it to terminate. Therefore, the data > > > inserted in the publication table is not applied on the subscription > > > table (i.e. no data replication happens) > > > > Seems like this commit, should have only set the > > LOGICALREP_PROTO_STREAM_VERSION_NUM to 2 but the > > LOGICALREP_PROTO_VERSION_NUM shouln't have been changed. > > I think if we don't increase the LOGICALREP_PROTO_VERSION_NUM, then > streaming will not work in the latest version. So what I feel is that > we can keep the LOGICALREP_PROTO_VERSION_NUM as 1 only add one more > parameter say LOGICALREP_PROTO_MAX_VERSION_NUM. So now from the > latest subscriber if streaming is enabled then we can send > LOGICALREP_PROTO_STREAM_VERSION_NUM and LOGICALREP_PROTO_VERSION_NUM > otherwise. And on publisher side we can check with the > max_protocol_version and min_protocol version. I have attached a > patch for the changes I have explained. > > -- > Regards, > Dilip Kumar > EnterpriseDB: http://www.enterprisedb.com -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: Dynamic gathering the values for seq_page_cost/xxx_cost
On Mon, Sep 21, 2020 at 9:11 AM Andy Fan wrote: > > Here are some changes for my detection program. > > | | seq_read_lat (us) | > random_read_lat (us) | > | FIO |12 | >148 | > | Previous main.c | 8.5 | > 74 | > | invalidate_device_cache before each testing | 9 | >150 | > | prepare the test data file with O_DIRECT option |15 | >150 | > > In invalidate_device_cache, I just create another 1GB data file and read > it. (see invalidate_device_cache function) this is similar as the previous > fio setup. > > prepare test data file with O_DIRECT option means in the past, I prepare the > test > file with buffer IO. and before testing, I do invalidate device cache, file > system cache. but the buffered prepared file still get better performance, I > have no idea of it. Since I don't want any cache. I use O_DIRECT > option at last. The seq_read_lat changed from 9us to 15us. > I still can't find out the 25% difference with the FIO result. (12 us vs 9 > us). > > At last, the random_page_cost happens to not change very much. > > /u/y/g/fdirect> sudo ./main > fs_cache_lat = 0.569031us, seq_read_lat = 18.901749us, random_page_lat = > 148.650589us > > cache hit ratio: 1.00 random_page_cost 1.00 > cache hit ratio: 0.90 random_page_cost 6.401019 > cache hit ratio: 0.50 random_page_cost 7.663772 > cache hit ratio: 0.10 random_page_cost 7.841498 > cache hit ratio: 0.00 random_page_cost 7.864383 > > This result looks much different from "we should use 1.1 ~ 1.5 for SSD". > Very interesting. Thanks for working on this. In an earlier email you mentioned that TPCH plans changed to efficient ones when you changed random_page_cost = =8.6 from 4 and seq_page_cost was set to 1. IIUC, setting random_page_cost to seq_page_cost to the same ratio as that between the corresponding latencies improved the plans. How about trying this with that ratio set to the one obtained from the latencies provided by FIO? Do we see any better plans? page cost is one thing, but there are CPU costs also involved in costs of expression evaluation. Should those be changed accordingly to the CPU latency? -- Best Wishes, Ashutosh Bapat
Re: PATCH: Batch/pipelining support for libpq
Dave Cramer www.postgres.rocks On Mon, 31 Aug 2020 at 11:46, Matthieu Garrigues < matthieu.garrig...@gmail.com> wrote: > Hi, > > It seems like this patch is nearly finished. I fixed all the remaining > issues. I'm also asking > a confirmation of the test scenarios you want to see in the next > version of the patch. > > > Hi, > > > > On 2020-07-10 19:01:49 -0400, Alvaro Herrera wrote: > > > Totally unasked for, here's a rebase of this patch series. I didn't do > > > anything other than rebasing to current master, solving a couple of > very > > > trivial conflicts, fixing some whitespace complaints by git apply, and > > > running tests to verify everthing works. > > > > > > I don't foresee working on this at all, so if anyone is interested in > > > seeing this feature in, I encourage them to read and address > > > Horiguchi-san's feedback. > > > > Nor am I planning to do so, but I do think its a pretty important > > improvement. > > > > > Fixed > > > > > > > > +/* > > > + * PQrecyclePipelinedCommand > > > + * Push a command queue entry onto the freelist. It must be a > dangling entry > > > + * with null next pointer and not referenced by any other entry's > next pointer. > > > + */ > > > > Dangling sounds a bit like it's already freed. > > > > > Fixed > > > > > > > > +/* > > > + * PQbatchSendQueue > > > + * End a batch submission by sending a protocol sync. The connection > will > > > + * remain in batch mode and unavailable for new synchronous command > execution > > > + * functions until all results from the batch are processed by the > client. > > > > I feel like the reference to the protocol sync is a bit too low level > > for an external API. It should first document what the function does > > from a user's POV. > > > > I think it'd also be good to document whether / whether not queries can > > already have been sent before PQbatchSendQueue is called or not. > > > Fixed > > > > > > > > > > + if (conn->batch_status == PQBATCH_MODE_ABORTED && conn->queryclass > != PGQUERY_SYNC) > > > + { > > > + /* > > > + * In an aborted batch we don't get anything from the server for each > > > + * result; we're just discarding input until we get to the next sync > > > + * from the server. The client needs to know its queries got aborted > > > + * so we create a fake PGresult to return immediately from > > > + * PQgetResult. > > > + */ > > > + conn->result = PQmakeEmptyPGresult(conn, > > > + PGRES_BATCH_ABORTED); > > > + if (!conn->result) > > > + { > > > + printfPQExpBuffer(&conn->errorMessage, > > > + libpq_gettext("out of memory")); > > > + pqSaveErrorResult(conn); > > > + return 0; > > > > Is there any way an application can recover at this point? ISTM we'd be > > stuck in the previous asyncStatus, no? > > > > conn->result is null when malloc(sizeof(PGresult)) returns null. It's > very unlikely unless > the server machine is out of memory, so the server will probably be > unresponsive anyway. > > I'm leaving this as it is but if anyone has a solution simple to > implement I'll fix it. > > > > > > > > +/* pqBatchFlush > > > + * In batch mode, data will be flushed only when the out buffer > reaches the threshold value. > > > + * In non-batch mode, data will be flushed all the time. > > > + */ > > > +static int > > > +pqBatchFlush(PGconn *conn) > > > +{ > > > + if ((conn->batch_status == PQBATCH_MODE_OFF)||(conn->outCount >= > OUTBUFFER_THRESHOLD)) > > > + return(pqFlush(conn)); > > > + return 0; /* Just to keep compiler quiet */ > > > +} > > > > unnecessarily long line. > > > Fixed > > > > > > +/* > > > + * Connection's outbuffer threshold is set to 64k as it is safe > > > + * in Windows as per comments in pqSendSome() API. > > > + */ > > > +#define OUTBUFFER_THRESHOLD 65536 > > > > I don't think the comment explains much. It's fine to send more than 64k > > with pqSendSome(), they'll just be send with separate pgsecure_write() > > invocations. And only on windows. > > > > It clearly makes sense to start sending out data at a certain > > granularity to avoid needing unnecessary amounts of memory, and to make > > more efficient use of latency / serer side compute. > > > > It's not implausible that 64k is the right amount for that, I just don't > > think the explanation above is good. > > > > Fixed > > > > diff --git a/src/test/modules/test_libpq/testlibpqbatch.c > b/src/test/modules/test_libpq/testlibpqbatch.c > > > new file mode 100644 > > > index 00..4d6ba266e5 > > > --- /dev/null > > > +++ b/src/test/modules/test_libpq/testlibpqbatch.c > > > @@ -0,0 +1,1456 @@ > > > +/* > > > + * src/test/modules/test_libpq/testlibpqbatch.c > > > + * > > > + * > > > + * testlibpqbatch.c > > > + * Test of batch execution functionality > > > + */ > > > + > > > +#ifdef WIN32 > > > +#include > > > +#endif > > > > ISTM that this shouldn't be needed in a test program like this? > > Shouldn't libpq abstract all of this away? > > > > Fixed. > > > > > > +static void > > > +simple_batch(PGconn *conn) > > > +{ > > > > ISTM that
Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.
On Mon, Sep 21, 2020 at 6:27 PM Ashutosh Sharma wrote: > > Thanks Dilip for the patch. AFAIU, the fix looks good. One small comment: > > In the error message we are still referring to the native protocol > version number. Shouldn't it be replaced with the greatest protocol > version number we support now (i.e. LOGICALREP_PROTO_MAX_VERSION_NUM)? > > - if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM) > + if (data->protocol_version > LOGICALREP_PROTO_MAX_VERSION_NUM) > ereport(ERROR, > (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > errmsg("client sent proto_version=%d but we only > support protocol %d or lower", > data->protocol_version, LOGICALREP_PROTO_VERSION_NUM))); > > Other than this, I don't have any comments. Thanks for the review. I have fixed it the attached patch. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com v2-0001-Bugfix-in-logical-protocol-version.patch Description: Binary data
Re: PATCH: Batch/pipelining support for libpq
On 2020-Sep-21, Dave Cramer wrote: Hello Dave, > I am looking for this in the commitfest and can't find it. However there is > an old commitfest entry > > https://commitfest.postgresql.org/13/1024/ > > Do you have the link for the new one ? Here you go: https://commitfest.postgresql.org/29/2724/ -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: Batch/pipelining support for libpq
Matthieu Garrigues On Mon, Sep 21, 2020 at 3:09 PM Dave Cramer wrote: >> > There was a comment upthread a while back that people should look at the > comments made in > https://www.postgresql.org/message-id/20180322.211148.187821341.horiguchi.kyotaro%40lab.ntt.co.jp > by Horiguchi-San. > > From what I can tell this has not been addressed. The one big thing is the > use of PQbatchProcessQueue vs just putting it in PQgetResult. > > The argument is that adding PQbatchProcessQueue is unnecessary and just adds > another step. Looking at this, it seems like putting this inside PQgetResult > would get my vote as it leaves the interface unchanged. > Ok. I'll merge PQbatchProcessQueue into PQgetResult. But just one thing: I'll keep PQgetResult returning null between the result of two batched query so the user can know which result comes from which query.
Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.
On Mon, Sep 21, 2020 at 6:27 PM Ashutosh Sharma wrote: > > Thanks Dilip for the patch. AFAIU, the fix looks good. One small comment: > Thanks Ashutosh and Dilip for working on this. I'll look into it in a day or two. -- With Regards, Amit Kapila.
Re: Yet another fast GiST build
> 21 сент. 2020 г., в 17:15, Heikki Linnakangas написал(а): > > On 21/09/2020 12:06, Andrey M. Borodin wrote >> >> I think we don't set rightlinks during index build. > > The new GiST sorting code does not, but the regular insert-based code does. > > That's a bit questionable in the new code actually. Was that a conscious > decision? The right-links are only needed when there are concurrent page > splits, so I think it's OK, but the checks for InvalidBlockNumber in > gistScanPage() and gistFindPage() have comment "/* sanity check */". > Comment changes are needed, at least. It was a conscious decision with incorrect motivation. I was thinking that it will help to reduce number of "false positive" inspecting right pages. But now I see that: 1. There should be no such "false positives" that we can avoid 2. Valid rightlinks could help to do amcheck verification in future But thing that bothers me now: when we vacuum leaf page, we bump it's NSN. But we do not bump internal page LSN. Does this means we will follow rightlinks after vacuum? It seems superflous. And btw we did not adjust internal page tuples after vacuum... Best regards, Andrey Borodin.
Re: Inconsistent Japanese name order in v13 contributors list
On 2020-Sep-18, Bruce Momjian wrote: > This thread from 2015 is the most comprehensive discussion I remember of > Japanese name ordering, including a suggestion to use small caps: > > > https://www.postgresql.org/message-id/flat/20150613231826.GY133018%40postgresql.org#88d245a5cdd2b32e1e3e80fc07eab6f2 > > I have been following this guidance ever since. Right. About smallcaps, we didn't do it then because there was no way known to us to use them in our then-current toolchain. But we've changed now to XML and apparently it is possible to use them -- we could try something like and define a CSS rule like .caps_lastname {font-variant: small-caps;} (Apparently you also need to set 'emphasis.propagates.style' to 1 in the stylesheet). This does it for HTML. You also need some FO trick to cover the PDF (probably something like what a042750646db did to change catalog_table_entry formatting.) -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: Batch/pipelining support for libpq
On Mon, 21 Sep 2020 at 09:21, Matthieu Garrigues < matthieu.garrig...@gmail.com> wrote: > Matthieu Garrigues > > On Mon, Sep 21, 2020 at 3:09 PM Dave Cramer > wrote: > >> > > There was a comment upthread a while back that people should look at the > comments made in > https://www.postgresql.org/message-id/20180322.211148.187821341.horiguchi.kyotaro%40lab.ntt.co.jp > by Horiguchi-San. > > > > From what I can tell this has not been addressed. The one big thing is > the use of PQbatchProcessQueue vs just putting it in PQgetResult. > > > > The argument is that adding PQbatchProcessQueue is unnecessary and just > adds another step. Looking at this, it seems like putting this inside > PQgetResult would get my vote as it leaves the interface unchanged. > > > > Ok. I'll merge PQbatchProcessQueue into PQgetResult. But just one > thing: I'll keep PQgetResult returning null between the result of two > batched query so the user > can know which result comes from which query. > Fair enough. There may be other things in his comments that need to be addressed. That was the big one that stuck out for me. Thanks for working on this! Dave Cramer www.postgres.rocks
Re: Command statistics system (cmdstats)
On 2020-Sep-18, Michael Paquier wrote: > Based on the low level of activity and the fact that the patch was > marked as waiting on author for a couple of weeks, it looks like > little could be achieved by the end of the CF, and the attention was > elsewhere, so it looked better (and it still does IMO) to give more > attention to the remaining 170~ patches that are still lying on the CF > app. "A couple of weeks" of inactivity is not sufficient, in my view, to boot a patch out of the commitfest process. Whenever the patch is resurrected, it will be a new entry which won't have the history that it had accumulated in the long time since it was created -- which biases it against other new patches. I'm not really arguing about this one patch only, but more generally about the process you're following. The problem is that you as CFM are imposing your personal priorities on the whole process by punting patches ahead of time -- you are saying that nobody else should be looking at updating this patch because it is now closed. It seems fair to do so at the end of the commitfest, but it does not seem fair to do it when it's still mid-cycle. Putting also in perspective the history that the patch had prior to your unfairly closing it, there was a lot of effort in getting it reviewed to a good state and updated by the author -- yet a single unsubstantiated comment, without itself a lot of effort on the reviewer's part, that "maybe it has a perf drawback" is sufficient to get it booted? It doesn't seem appropriate. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: Batch/pipelining support for libpq
On Mon, Sep 21, 2020 at 3:39 PM Dave Cramer wrote: > > > > On Mon, 21 Sep 2020 at 09:21, Matthieu Garrigues > wrote: >> >> Matthieu Garrigues >> >> On Mon, Sep 21, 2020 at 3:09 PM Dave Cramer >> wrote: >> >> >> > There was a comment upthread a while back that people should look at the >> > comments made in >> > https://www.postgresql.org/message-id/20180322.211148.187821341.horiguchi.kyotaro%40lab.ntt.co.jp >> > by Horiguchi-San. >> > >> > From what I can tell this has not been addressed. The one big thing is the >> > use of PQbatchProcessQueue vs just putting it in PQgetResult. >> > >> > The argument is that adding PQbatchProcessQueue is unnecessary and just >> > adds another step. Looking at this, it seems like putting this inside >> > PQgetResult would get my vote as it leaves the interface unchanged. >> > >> >> Ok. I'll merge PQbatchProcessQueue into PQgetResult. But just one >> thing: I'll keep PQgetResult returning null between the result of two >> batched query so the user >> can know which result comes from which query. > > > Fair enough. > > There may be other things in his comments that need to be addressed. That was > the big one that stuck out for me. > > Thanks for working on this! > Yes I already addressed the other things in the v19 patch: https://www.postgresql.org/message-id/flat/cajkzx4t5e-2cqe3dtv2r78dyfvz+in8py7a8marvlhs_pg7...@mail.gmail.com
Re: Yet another fast GiST build
> 21 сент. 2020 г., в 18:29, Andrey M. Borodin > написал(а): > > It was a conscious decision with incorrect motivation. I was thinking that it > will help to reduce number of "false positive" inspecting right pages. But > now I see that: > 1. There should be no such "false positives" that we can avoid > 2. Valid rightlinks could help to do amcheck verification in future Well, point number 2 here is invalid. There exist one leaf page p, so that if we start traversing rightlink from p we will reach all leaf pages. But we practically have no means to find this page. This makes rightlinks not very helpful in amcheck for GiST. But for consistency I think it worth to install them. Thanks! Best regards, Andrey Borodin. 0001-Install-rightlinks-on-GiST-pages-in-case-of-sorting-.patch Description: Binary data
RE: Binaries on s390x arch
Hi Christoph, Thank you for your response. We will be glad to obtain binaries for s390x on RHEL, SLES and Ubuntu distros. We are ready to provide the necessary infra. To enable this, we would need more information about VM configuration(No of VMs, OS, vCPUs, memory, Storage). Secondly T &C's would be needed to be signed electronically. Please let me know if you are OK to proceed and we can communicate further. Thanks and Regards, Namrata From: Christoph Berg To: Namrata Bhave Cc: pgsql-hackers@lists.postgresql.org, Prasanna Kelkar Date: 09/21/2020 02:31 PM Subject:[EXTERNAL] Re: Binaries on s390x arch Re: Namrata Bhave > As seen from downloads page, the Apt repo/rpms are not yet available for > s390x for latest versions. Hi, are you asking about apt (.deb) or yum (.rpm) packages? > Wanted to know if there is any work going on/planned to provide Postgres in > ready-to-use package or installer form for s390x architecture? s390x is fully supported as a Debian release architecture, so you can get PostgreSQL and all packaged modules there (PG11 on Debian buster). The apt team doesn't have plans for s390x yet on the PostgreSQL side. If you are interested, are you in a position to donate a build host? Christoph
Re: Yet another fast GiST build
Heikki Linnakangas writes: > On 21/09/2020 02:06, Tom Lane wrote: >> Another interesting point is that all the other index AMs seem to WAL-log >> the new page before the smgrextend call, whereas this code is doing it >> in the other order. I strongly doubt that both patterns are equally >> correct. Could be that the other AMs are in the wrong though. > My thinking was that it's better to call smgrextend() first, so that if > you run out of disk space, you get the error before WAL-logging it. That > reduces the chance that WAL replay will run out of disk space. A lot of > things are different during WAL replay, so it's quite likely that WAL > replay runs out of disk space anyway if you're living on the edge, but > still. Yeah. access/transam/README points out that such failures need to be planned for, and explains what we do for heap pages; 1. Adding a disk page to an existing table. This action isn't WAL-logged at all. We extend a table by writing a page of zeroes at its end. We must actually do this write so that we are sure the filesystem has allocated the space. If the write fails we can just error out normally. Once the space is known allocated, we can initialize and fill the page via one or more normal WAL-logged actions. Because it's possible that we crash between extending the file and writing out the WAL entries, we have to treat discovery of an all-zeroes page in a table or index as being a non-error condition. In such cases we can just reclaim the space for re-use. So GIST seems to be acting according to that design. (Someday we need to update this para to acknowledge that not all filesystems behave as it's assuming.) > I didn't notice that the other callers are doing it the other way round, > though. I think they need to, so that they can stamp the page with the > LSN of the WAL record. But GiST build is special in that regard, because > it stamps all pages with GistBuildLSN. Kind of unpleasant; that means they risk what the README points out: In all of these cases, if WAL replay fails to redo the original action we must panic and abort recovery. The DBA will have to manually clean up (for instance, free up some disk space or fix directory permissions) and then restart recovery. This is part of the reason for not writing a WAL entry until we've successfully done the original action. I'm not sufficiently motivated to go and change it right now, though. regards, tom lane
Re: Global snapshots
On 2020-09-18 00:54, Bruce Momjian wrote: On Tue, Sep 8, 2020 at 01:36:16PM +0300, Alexey Kondratov wrote: Thank you for the link! After a quick look on the Sawada-san's patch set I think that there are two major differences: 1. There is a built-in foreign xacts resolver in the [1], which should be much more convenient from the end-user perspective. It involves huge in-core changes and additional complexity that is of course worth of. However, it's still not clear for me that it is possible to resolve all foreign prepared xacts on the Postgres' own side with a 100% guarantee. Imagine a situation when the coordinator node is actually a HA cluster group (primary + sync + async replica) and it failed just after PREPARE stage of after local COMMIT. In that case all foreign xacts will be left in the prepared state. After failover process complete synchronous replica will become a new primary. Would it have all required info to properly resolve orphan prepared xacts? Probably, this situation is handled properly in the [1], but I've not yet finished a thorough reading of the patch set, though it has a great doc! On the other hand, previous 0003 and my proposed patch rely on either manual resolution of hung prepared xacts or usage of external monitor/resolver. This approach is much simpler from the in-core perspective, but doesn't look as complete as [1] though. Have we considered how someone would clean up foreign transactions if the coordinating server dies? Could it be done manually? Would an external resolver, rather than an internal one, make this easier? Both Sawada-san's patch [1] and in this thread (e.g. mine [2]) use 2PC with a special gid format including a xid + server identification info. Thus, one can select from pg_prepared_xacts, get xid and coordinator info, then use txid_status() on the coordinator (or ex-coordinator) to get transaction status and finally either commit or abort these stale prepared xacts. Of course this could be wrapped into some user-level support routines as it is done in the [1]. As for the benefits of using an external resolver, I think that there are some of them from the whole system perspective: 1) If one follows the logic above, then this resolver could be stateless, it takes all the required info from the Postgres nodes themselves. 2) Then you can easily put it into container, which make it easier do deploy to all these 'cloud' stuff like kubernetes. 3) Also you can scale resolvers independently from Postgres nodes. I do not think that either of these points is a game changer, but we use a very simple external resolver altogether with [2] in our sharding prototype and it works just fine so far. [1] https://www.postgresql.org/message-id/CA%2Bfd4k4HOVqqC5QR4H984qvD0Ca9g%3D1oLYdrJT_18zP9t%2BUsJg%40mail.gmail.com [2] https://www.postgresql.org/message-id/3ef7877bfed0582019eab3d462a43275%40postgrespro.ru -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: recovering from "found xmin ... from before relfrozenxid ..."
Amit Kapila writes: > On Sun, Sep 20, 2020 at 10:43 PM Tom Lane wrote: >> AFAICS, there is no chance of the existing pg_surgery regression test >> being fully stable if we don't fix both things. > What if ensure that it runs with autovacuum = off and there is no > parallel test running? I am not sure about the second part but if we > can do that then the test will be probably stable. Then it'll not be usable under "make installcheck", which is not very nice. It's also arguable that you aren't testing pg_surgery under real-world conditions if you do it like that. Moreover, I think that both of these points need to be addressed anyway, as they represent bugs that are reachable independently of pg_surgery. Admittedly, we do not have a test case that proves that the inconsistency between pruneheap and vacuum has any bad effects in the absence of a7212be8b. But do you really want to bet that there are none? regards, tom lane
PGXS testing upgrade paths
If there's a better list than this, please let me know, but I figured hackers is appropriate since the extension building infrastructure is documented in core. While working on an in-house extension I realized that while PGXS provides the standard regression test infrastructure, I'm not aware of an automatic or standard way to test all upgrade paths provided by the extension. Is there something I'm missing? Or an approach people like (I suppose the simplest way would be "manually" executing the upgrade files in a regress test, but that seems tedious). Thanks, James
Re: Lift line-length limit for pg_service.conf
Daniel Gustafsson writes: > The pg_service.conf parsing thread [0] made me realize that we have a > hardwired > line length of max 256 bytes. Lifting this would be in line with recent work > for ecpg, pg_regress and pg_hba (784b1ba1a2 and 8f8154a50). The attached > moves > pg_service.conf to use the new pg_get_line_append API and a StringInfo to lift > the restriction. Any reason not to do that while we're lifting other such > limits? +1. I'd been thinking of looking around at our fgets calls to see which ones need this sort of work, but didn't get to it yet. Personally, I'd avoid depending on StringInfo.cursor here, as the dependency isn't really buying you anything. Instead you could do line = linebuf.data; just after the trim-trailing-whitespace loop and then leave the "ignore leading whitespace too" code as it stands. Also, the need for inserting the pfree into multiple exit paths kind of makes me itch. I wonder if we should change the ending code to look like exit: fclose(f); pfree(linebuf.data); return result; and then the early exit spots would be replaced with "result = x; goto exit". (Some of them could use "break", but not all, so it's probably better to consistently use "goto".) regards, tom lane
Re: PGXS testing upgrade paths
James Coleman writes: > If there's a better list than this, please let me know, but I figured > hackers is appropriate since the extension building infrastructure is > documented in core. > While working on an in-house extension I realized that while PGXS > provides the standard regression test infrastructure, I'm not aware of > an automatic or standard way to test all upgrade paths provided by the > extension. The recommended way to deal with updates these days is to leave the original extension script as-is and just write update scripts (1.0--1.1, 1.1--1.2, etc). That way, application of the updates is tested automatically every time you do CREATE EXTENSION. Now, if you also want to check that the intermediate states still behave as intended, I don't see much of a solution that doesn't involve custom test scaffolding. regards, tom lane
Re: PGXS testing upgrade paths
On Mon, Sep 21, 2020 at 11:36 AM Tom Lane wrote: > > James Coleman writes: > > If there's a better list than this, please let me know, but I figured > > hackers is appropriate since the extension building infrastructure is > > documented in core. > > > While working on an in-house extension I realized that while PGXS > > provides the standard regression test infrastructure, I'm not aware of > > an automatic or standard way to test all upgrade paths provided by the > > extension. > > The recommended way to deal with updates these days is to leave the > original extension script as-is and just write update scripts > (1.0--1.1, 1.1--1.2, etc). That way, application of the updates > is tested automatically every time you do CREATE EXTENSION. Ah, so just don't add a new 1.2 file, etc. That also implies not having more direct upgrade paths (e.g., 1.0--1.2)? > Now, if you also want to check that the intermediate states still > behave as intended, I don't see much of a solution that doesn't > involve custom test scaffolding. Yeah, I'm not so much concerned about intermediate states so much as multiple upgrade paths and/or multiple single-version install files (which you replied to already above). James
Re: PGXS testing upgrade paths
James Coleman writes: > On Mon, Sep 21, 2020 at 11:36 AM Tom Lane wrote: >> The recommended way to deal with updates these days is to leave the >> original extension script as-is and just write update scripts >> (1.0--1.1, 1.1--1.2, etc). That way, application of the updates >> is tested automatically every time you do CREATE EXTENSION. > Ah, so just don't add a new 1.2 file, etc. > That also implies not having more direct upgrade paths (e.g., 1.0--1.2)? Right. Once the accumulation of cruft starts to impact install time substantially, maybe you want to roll things up to a new base version. But at least with the contrib modules we've found that it's seldom worth the trouble. regards, tom lane
Re: Planner, check if can use consider HASH for groupings (src/backend/optimizer/plan/planner.c)
On Mon, Sep 21, 2020 at 08:32:35AM -0300, Ranier Vilela wrote: Em dom., 20 de set. de 2020 às 21:10, Tomas Vondra < tomas.von...@2ndquadrant.com> escreveu: On Sun, Sep 20, 2020 at 08:09:56PM -0300, Ranier Vilela wrote: >Em sex., 18 de set. de 2020 às 10:37, Tomas Vondra < >tomas.von...@2ndquadrant.com> escreveu: > >> On Thu, Sep 17, 2020 at 06:31:12PM -0300, Ranier Vilela wrote: >> >Em qui., 17 de set. de 2020 às 17:09, Tomas Vondra < >> >tomas.von...@2ndquadrant.com> escreveu: >> > >> >> On Thu, Sep 17, 2020 at 02:12:12PM -0300, Ranier Vilela wrote: >> >> >Hi, >> >> > >> >> >In case gd->any_hashable is FALSE, grouping_is_hashable is never >> called. >> >> >In this case, the planner could use HASH for groupings, but will never >> >> know. >> >> > >> >> >> >> The condition is pretty simple - if the query has grouping sets, look at >> >> grouping sets, otherwise look at groupClause. For this to be an issue >> >> the query would need to have both grouping sets and (independent) group >> >> clause - which AFAIK is not possible. >> >> >> >Uh? >> >(parse->groupClause != NIL) If different from NIL we have ((independent) >> >group clause), grouping_is_hashable should check? >> >(gd ? gd->any_hashable : grouping_is_hashable(parse->groupClause >> >If gd is not NIL and gd->any_hashtable is FALSE? >> > >> >> Sorry, I'm not quite sure I understand what this is meant to say :-( >> >> Anyway, (groupClause != NIL) does not mean the groupClause is somehow >> independent (of what?). Add some debugging to create_grouping_paths and >> you'll see that e.g. this query ends up with groupClause with 3 items: >> >> select 1 from t group by grouping sets ((a), (b), (c)); >> >> and so does this one: >> >> select 1 from t group by grouping sets ((a,c), (a,b)); >> >> I'm no expert in grouping sets, but I'd bet this means we transform the >> grouping sets into a groupClause by extracting the keys. I haven't >> investigated why exactly we do this, but I'm sure there's a reason (e.g. >> it gives us SortGroupClause). >> >> You seem to believe a query can have both grouping sets and regular >> grouping at the same level - but how would such query look like? Surely >> you can't have two GROuP BY clauses. You can do >> >Translating into code: >gd is grouping sets and >parse->groupClause is regular grouping >and we cannot have both at the same time. > Have you verified the claim that we can't have both at the same time? As I already explained, we build groupClause from grouping sets. I suggest you do some debugging using the queries I used as examples, and you'll see the claim is not true. I think we already agreed that we cannot have both at the same time. No, we haven't. I think we've agreed that you can't have both a group by clause with grouping sets and then another group by clause with "plain" grouping. But even with grouping sets you'll have groupClause generated from the grouping sets. See preprocess_grouping_sets. > >> select 1 from t group by a, grouping sets ((b), (c)); >> >> which is however mostly equivalent to (AFAICS) >> >> select 1 from t group by grouping sets ((a,b), (a,c)) >> >> so it's not like we have an independent groupClause either I think. >> >> The whole point of the original condition is - if there are grouping >> sets, check if at least one can be executed using hashing (i.e. all keys >> are hashable). Otherwise (without grouping sets) look at the grouping as >> a whole. >> >Or if gd is NULL check parse->groupClause. > Which is what I'm saying. > >> I don't see how your change improves this - if there are grouping sets, >> it's futile to look at the whole groupClause if at least one grouping >> set can't be hashed. >> >> But there's a simple way to disprove this - show us a case (table and a >> query) where your code correctly allows hashing while the current one >> does not. > >I'm not an expert in grouping either. >The question I have here is whether gd is populated and has gd-> >any_hashable as FALSE, >Its mean no use checking parse-> groupClause, it's a waste of time, Ok. > I'm sorry, I don't follow your logic. Those are two separate cases. If we have grouping sets, we have to check if at least one can be hashed. If we don't have grouping sets, we have to check groupClause directly. Why would that be a waste of time is unclear to me. This is clear to me. The problem is how it was implemented in create_grouping_paths. You haven't demonstrated what the problem is, though. Showing us a query that fails would make it very clear. > >> >> > >> >> For hashing to be worth considering, at least one grouping set has to be >> >> hashable - otherwise it's pointless. >> >> >> >> Granted, you could have something like >> >> >> >> GROUP BY GROUPING SETS ((a), (b)), c >> >> >> >> which I think essentially says "add c to every grouping set" and that >> >> will be covered by the any_hashable check. >> >> >> >I am not going into the merit of whether or not it is worth hash
Re: doc review for v13
Justin Pryzby writes: > Thanks. Here's the remainder, with some new ones. LGTM. I tweaked one or two places a bit more, and pushed it. regards, tom lane
Re: Planner, check if can use consider HASH for groupings (src/backend/optimizer/plan/planner.c)
Em seg., 21 de set. de 2020 às 13:37, Tomas Vondra < tomas.von...@2ndquadrant.com> escreveu: > On Mon, Sep 21, 2020 at 08:32:35AM -0300, Ranier Vilela wrote: > >Em dom., 20 de set. de 2020 às 21:10, Tomas Vondra < > >tomas.von...@2ndquadrant.com> escreveu: > > > >> On Sun, Sep 20, 2020 at 08:09:56PM -0300, Ranier Vilela wrote: > >> >Em sex., 18 de set. de 2020 às 10:37, Tomas Vondra < > >> >tomas.von...@2ndquadrant.com> escreveu: > >> > > >> >> On Thu, Sep 17, 2020 at 06:31:12PM -0300, Ranier Vilela wrote: > >> >> >Em qui., 17 de set. de 2020 às 17:09, Tomas Vondra < > >> >> >tomas.von...@2ndquadrant.com> escreveu: > >> >> > > >> >> >> On Thu, Sep 17, 2020 at 02:12:12PM -0300, Ranier Vilela wrote: > >> >> >> >Hi, > >> >> >> > > >> >> >> >In case gd->any_hashable is FALSE, grouping_is_hashable is never > >> >> called. > >> >> >> >In this case, the planner could use HASH for groupings, but will > >> never > >> >> >> know. > >> >> >> > > >> >> >> > >> >> >> The condition is pretty simple - if the query has grouping sets, > >> look at > >> >> >> grouping sets, otherwise look at groupClause. For this to be an > issue > >> >> >> the query would need to have both grouping sets and (independent) > >> group > >> >> >> clause - which AFAIK is not possible. > >> >> >> > >> >> >Uh? > >> >> >(parse->groupClause != NIL) If different from NIL we have > >> ((independent) > >> >> >group clause), grouping_is_hashable should check? > >> >> >(gd ? gd->any_hashable : grouping_is_hashable(parse->groupClause > >> >> >If gd is not NIL and gd->any_hashtable is FALSE? > >> >> > > >> >> > >> >> Sorry, I'm not quite sure I understand what this is meant to say :-( > >> >> > >> >> Anyway, (groupClause != NIL) does not mean the groupClause is somehow > >> >> independent (of what?). Add some debugging to create_grouping_paths > and > >> >> you'll see that e.g. this query ends up with groupClause with 3 > items: > >> >> > >> >> select 1 from t group by grouping sets ((a), (b), (c)); > >> >> > >> >> and so does this one: > >> >> > >> >> select 1 from t group by grouping sets ((a,c), (a,b)); > >> >> > >> >> I'm no expert in grouping sets, but I'd bet this means we transform > the > >> >> grouping sets into a groupClause by extracting the keys. I haven't > >> >> investigated why exactly we do this, but I'm sure there's a reason > (e.g. > >> >> it gives us SortGroupClause). > >> >> > >> >> You seem to believe a query can have both grouping sets and regular > >> >> grouping at the same level - but how would such query look like? > Surely > >> >> you can't have two GROuP BY clauses. You can do > >> >> > >> >Translating into code: > >> >gd is grouping sets and > >> >parse->groupClause is regular grouping > >> >and we cannot have both at the same time. > >> > > >> > >> Have you verified the claim that we can't have both at the same time? As > >> I already explained, we build groupClause from grouping sets. I suggest > >> you do some debugging using the queries I used as examples, and you'll > >> see the claim is not true. > >> > >I think we already agreed that we cannot have both at the same time. > > > > No, we haven't. I think we've agreed that you can't have both a group by > clause with grouping sets and then another group by clause with "plain" > grouping. But even with grouping sets you'll have groupClause generated > from the grouping sets. See preprocess_grouping_sets. > > > > >> > >> > > >> >> select 1 from t group by a, grouping sets ((b), (c)); > >> >> > >> >> which is however mostly equivalent to (AFAICS) > >> >> > >> >> select 1 from t group by grouping sets ((a,b), (a,c)) > >> >> > >> >> so it's not like we have an independent groupClause either I think. > >> >> > >> >> The whole point of the original condition is - if there are grouping > >> >> sets, check if at least one can be executed using hashing (i.e. all > keys > >> >> are hashable). Otherwise (without grouping sets) look at the > grouping as > >> >> a whole. > >> >> > >> >Or if gd is NULL check parse->groupClause. > >> > > >> > >> Which is what I'm saying. > >> > >> > > >> >> I don't see how your change improves this - if there are grouping > sets, > >> >> it's futile to look at the whole groupClause if at least one grouping > >> >> set can't be hashed. > >> >> > >> >> But there's a simple way to disprove this - show us a case (table > and a > >> >> query) where your code correctly allows hashing while the current one > >> >> does not. > >> > > >> >I'm not an expert in grouping either. > >> >The question I have here is whether gd is populated and has gd-> > >> >any_hashable as FALSE, > >> >Its mean no use checking parse-> groupClause, it's a waste of time, Ok. > >> > > >> > >> I'm sorry, I don't follow your logic. Those are two separate cases. If > >> we have grouping sets, we have to check if at least one can be hashed. > >> If we don't have grouping sets, we have to check groupClause directly. > >> Why would that be a waste of time is unclear to
Re: Planner, check if can use consider HASH for groupings (src/backend/optimizer/plan/planner.c)
On Sun, Sep 20, 2020 at 01:50:34PM -0300, Ranier Vilela wrote: Em seg., 21 de set. de 2020 às 13:37, Tomas Vondra < tomas.von...@2ndquadrant.com> escreveu: On Mon, Sep 21, 2020 at 08:32:35AM -0300, Ranier Vilela wrote: >Em dom., 20 de set. de 2020 às 21:10, Tomas Vondra < >tomas.von...@2ndquadrant.com> escreveu: > >> On Sun, Sep 20, 2020 at 08:09:56PM -0300, Ranier Vilela wrote: >> >Em sex., 18 de set. de 2020 às 10:37, Tomas Vondra < >> >tomas.von...@2ndquadrant.com> escreveu: >> > >> >> On Thu, Sep 17, 2020 at 06:31:12PM -0300, Ranier Vilela wrote: >> >> >Em qui., 17 de set. de 2020 às 17:09, Tomas Vondra < >> >> >tomas.von...@2ndquadrant.com> escreveu: >> >> > >> >> >> On Thu, Sep 17, 2020 at 02:12:12PM -0300, Ranier Vilela wrote: >> >> >> >Hi, >> >> >> > >> >> >> >In case gd->any_hashable is FALSE, grouping_is_hashable is never >> >> called. >> >> >> >In this case, the planner could use HASH for groupings, but will >> never >> >> >> know. >> >> >> > >> >> >> >> >> >> The condition is pretty simple - if the query has grouping sets, >> look at >> >> >> grouping sets, otherwise look at groupClause. For this to be an issue >> >> >> the query would need to have both grouping sets and (independent) >> group >> >> >> clause - which AFAIK is not possible. >> >> >> >> >> >Uh? >> >> >(parse->groupClause != NIL) If different from NIL we have >> ((independent) >> >> >group clause), grouping_is_hashable should check? >> >> >(gd ? gd->any_hashable : grouping_is_hashable(parse->groupClause >> >> >If gd is not NIL and gd->any_hashtable is FALSE? >> >> > >> >> >> >> Sorry, I'm not quite sure I understand what this is meant to say :-( >> >> >> >> Anyway, (groupClause != NIL) does not mean the groupClause is somehow >> >> independent (of what?). Add some debugging to create_grouping_paths and >> >> you'll see that e.g. this query ends up with groupClause with 3 items: >> >> >> >> select 1 from t group by grouping sets ((a), (b), (c)); >> >> >> >> and so does this one: >> >> >> >> select 1 from t group by grouping sets ((a,c), (a,b)); >> >> >> >> I'm no expert in grouping sets, but I'd bet this means we transform the >> >> grouping sets into a groupClause by extracting the keys. I haven't >> >> investigated why exactly we do this, but I'm sure there's a reason (e.g. >> >> it gives us SortGroupClause). >> >> >> >> You seem to believe a query can have both grouping sets and regular >> >> grouping at the same level - but how would such query look like? Surely >> >> you can't have two GROuP BY clauses. You can do >> >> >> >Translating into code: >> >gd is grouping sets and >> >parse->groupClause is regular grouping >> >and we cannot have both at the same time. >> > >> >> Have you verified the claim that we can't have both at the same time? As >> I already explained, we build groupClause from grouping sets. I suggest >> you do some debugging using the queries I used as examples, and you'll >> see the claim is not true. >> >I think we already agreed that we cannot have both at the same time. > No, we haven't. I think we've agreed that you can't have both a group by clause with grouping sets and then another group by clause with "plain" grouping. But even with grouping sets you'll have groupClause generated from the grouping sets. See preprocess_grouping_sets. > >> >> > >> >> select 1 from t group by a, grouping sets ((b), (c)); >> >> >> >> which is however mostly equivalent to (AFAICS) >> >> >> >> select 1 from t group by grouping sets ((a,b), (a,c)) >> >> >> >> so it's not like we have an independent groupClause either I think. >> >> >> >> The whole point of the original condition is - if there are grouping >> >> sets, check if at least one can be executed using hashing (i.e. all keys >> >> are hashable). Otherwise (without grouping sets) look at the grouping as >> >> a whole. >> >> >> >Or if gd is NULL check parse->groupClause. >> > >> >> Which is what I'm saying. >> >> > >> >> I don't see how your change improves this - if there are grouping sets, >> >> it's futile to look at the whole groupClause if at least one grouping >> >> set can't be hashed. >> >> >> >> But there's a simple way to disprove this - show us a case (table and a >> >> query) where your code correctly allows hashing while the current one >> >> does not. >> > >> >I'm not an expert in grouping either. >> >The question I have here is whether gd is populated and has gd-> >> >any_hashable as FALSE, >> >Its mean no use checking parse-> groupClause, it's a waste of time, Ok. >> > >> >> I'm sorry, I don't follow your logic. Those are two separate cases. If >> we have grouping sets, we have to check if at least one can be hashed. >> If we don't have grouping sets, we have to check groupClause directly. >> Why would that be a waste of time is unclear to me. >> >This is clear to me. >The problem is how it was implemented in create_grouping_paths. > You haven't demonstrated what the problem is, though. Showing us a query that fails would
Re: PATCH: Batch/pipelining support for libpq
Hi Dave, I merged PQbatchProcessQueue into PQgetResult. One first init call to PQbatchProcessQueue was also required in PQsendQueue to have PQgetResult ready to read the first batch query. Tests and documentation are updated accordingly. Matthieu Garrigues On Mon, Sep 21, 2020 at 3:39 PM Dave Cramer wrote: > > > > On Mon, 21 Sep 2020 at 09:21, Matthieu Garrigues > wrote: >> >> Matthieu Garrigues >> >> On Mon, Sep 21, 2020 at 3:09 PM Dave Cramer >> wrote: >> >> >> > There was a comment upthread a while back that people should look at the >> > comments made in >> > https://www.postgresql.org/message-id/20180322.211148.187821341.horiguchi.kyotaro%40lab.ntt.co.jp >> > by Horiguchi-San. >> > >> > From what I can tell this has not been addressed. The one big thing is the >> > use of PQbatchProcessQueue vs just putting it in PQgetResult. >> > >> > The argument is that adding PQbatchProcessQueue is unnecessary and just >> > adds another step. Looking at this, it seems like putting this inside >> > PQgetResult would get my vote as it leaves the interface unchanged. >> > >> >> Ok. I'll merge PQbatchProcessQueue into PQgetResult. But just one >> thing: I'll keep PQgetResult returning null between the result of two >> batched query so the user >> can know which result comes from which query. > > > Fair enough. > > There may be other things in his comments that need to be addressed. That was > the big one that stuck out for me. > > Thanks for working on this! > > > Dave Cramer > www.postgres.rocks diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml index 92556c7ce0..15d0c03c89 100644 --- a/doc/src/sgml/libpq.sgml +++ b/doc/src/sgml/libpq.sgml @@ -4829,6 +4829,465 @@ int PQflush(PGconn *conn); + + Batch mode and query pipelining + + + libpq + batch mode + + + + libpq + pipelining + + + + libpq supports queueing up queries into + a pipeline to be executed as a batch on the server. Batching queries allows + applications to avoid a client/server round-trip after each query to get + the results before issuing the next query. + + + + When to use batching + + +Much like asynchronous query mode, there is no performance disadvantage to +using batching and pipelining. It increases client application complexity +and extra caution is required to prevent client/server deadlocks but +can sometimes offer considerable performance improvements. + + + +Batching is most useful when the server is distant, i.e. network latency +(ping time) is high, and when many small operations are being performed in +rapid sequence. There is usually less benefit in using batches when each +query takes many multiples of the client/server round-trip time to execute. +A 100-statement operation run on a server 300ms round-trip-time away would take +30 seconds in network latency alone without batching; with batching it may spend +as little as 0.3s waiting for results from the server. + + + +Use batches when your application does lots of small +INSERT, UPDATE and +DELETE operations that can't easily be transformed into +operations on sets or into a +COPY operation. + + + +Batching is not useful when information from one operation is required by the +client before it knows enough to send the next operation. The client must +introduce a synchronisation point and wait for a full client/server +round-trip to get the results it needs. However, it's often possible to +adjust the client design to exchange the required information server-side. +Read-modify-write cycles are especially good candidates; for example: + + BEGIN; + SELECT x FROM mytable WHERE id = 42 FOR UPDATE; + -- result: x=2 + -- client adds 1 to x: + UPDATE mytable SET x = 3 WHERE id = 42; + COMMIT; + +could be much more efficiently done with: + + UPDATE mytable SET x = x + 1 WHERE id = 42; + + + + + + The batch API was introduced in PostgreSQL 10.0, but clients using PostgresSQL 10.0 version of libpq can + use batches on server versions 7.4 and newer. Batching works on any server + that supports the v3 extended query protocol. + + + + + + + Using batch mode + + +To issue batches the application must switch +a connection into batch mode. Enter batch mode with PQenterBatchMode(conn) or test +whether batch mode is active with PQbatchStatus(conn). In batch mode only asynchronous operations are permitted, and +COPY is not recommended as it most likely will trigger failure in batch processing. +Using any synchronous command execution functions such as PQfn, +PQexec or one of its sibling functions are error conditions. +Functions allowed in batch mode are described in . + + + +The client uses libpq's asynchronous query functions to dispatch work, +marking the end of each batch with PQbatchSendQueue. +And
Re: Planner, check if can use consider HASH for groupings (src/backend/optimizer/plan/planner.c)
Em seg., 21 de set. de 2020 às 14:24, Tomas Vondra < tomas.von...@2ndquadrant.com> escreveu: > On Sun, Sep 20, 2020 at 01:50:34PM -0300, Ranier Vilela wrote: > >Em seg., 21 de set. de 2020 às 13:37, Tomas Vondra < > >tomas.von...@2ndquadrant.com> escreveu: > > > >> On Mon, Sep 21, 2020 at 08:32:35AM -0300, Ranier Vilela wrote: > >> >Em dom., 20 de set. de 2020 às 21:10, Tomas Vondra < > >> >tomas.von...@2ndquadrant.com> escreveu: > >> > > >> >> On Sun, Sep 20, 2020 at 08:09:56PM -0300, Ranier Vilela wrote: > >> >> >Em sex., 18 de set. de 2020 às 10:37, Tomas Vondra < > >> >> >tomas.von...@2ndquadrant.com> escreveu: > >> >> > > >> >> >> On Thu, Sep 17, 2020 at 06:31:12PM -0300, Ranier Vilela wrote: > >> >> >> >Em qui., 17 de set. de 2020 às 17:09, Tomas Vondra < > >> >> >> >tomas.von...@2ndquadrant.com> escreveu: > >> >> >> > > >> >> >> >> On Thu, Sep 17, 2020 at 02:12:12PM -0300, Ranier Vilela wrote: > >> >> >> >> >Hi, > >> >> >> >> > > >> >> >> >> >In case gd->any_hashable is FALSE, grouping_is_hashable is > never > >> >> >> called. > >> >> >> >> >In this case, the planner could use HASH for groupings, but > will > >> >> never > >> >> >> >> know. > >> >> >> >> > > >> >> >> >> > >> >> >> >> The condition is pretty simple - if the query has grouping > sets, > >> >> look at > >> >> >> >> grouping sets, otherwise look at groupClause. For this to be an > >> issue > >> >> >> >> the query would need to have both grouping sets and > (independent) > >> >> group > >> >> >> >> clause - which AFAIK is not possible. > >> >> >> >> > >> >> >> >Uh? > >> >> >> >(parse->groupClause != NIL) If different from NIL we have > >> >> ((independent) > >> >> >> >group clause), grouping_is_hashable should check? > >> >> >> >(gd ? gd->any_hashable : > grouping_is_hashable(parse->groupClause > >> >> >> >If gd is not NIL and gd->any_hashtable is FALSE? > >> >> >> > > >> >> >> > >> >> >> Sorry, I'm not quite sure I understand what this is meant to say > :-( > >> >> >> > >> >> >> Anyway, (groupClause != NIL) does not mean the groupClause is > somehow > >> >> >> independent (of what?). Add some debugging to > create_grouping_paths > >> and > >> >> >> you'll see that e.g. this query ends up with groupClause with 3 > >> items: > >> >> >> > >> >> >> select 1 from t group by grouping sets ((a), (b), (c)); > >> >> >> > >> >> >> and so does this one: > >> >> >> > >> >> >> select 1 from t group by grouping sets ((a,c), (a,b)); > >> >> >> > >> >> >> I'm no expert in grouping sets, but I'd bet this means we > transform > >> the > >> >> >> grouping sets into a groupClause by extracting the keys. I haven't > >> >> >> investigated why exactly we do this, but I'm sure there's a reason > >> (e.g. > >> >> >> it gives us SortGroupClause). > >> >> >> > >> >> >> You seem to believe a query can have both grouping sets and > regular > >> >> >> grouping at the same level - but how would such query look like? > >> Surely > >> >> >> you can't have two GROuP BY clauses. You can do > >> >> >> > >> >> >Translating into code: > >> >> >gd is grouping sets and > >> >> >parse->groupClause is regular grouping > >> >> >and we cannot have both at the same time. > >> >> > > >> >> > >> >> Have you verified the claim that we can't have both at the same > time? As > >> >> I already explained, we build groupClause from grouping sets. I > suggest > >> >> you do some debugging using the queries I used as examples, and > you'll > >> >> see the claim is not true. > >> >> > >> >I think we already agreed that we cannot have both at the same time. > >> > > >> > >> No, we haven't. I think we've agreed that you can't have both a group by > >> clause with grouping sets and then another group by clause with "plain" > >> grouping. But even with grouping sets you'll have groupClause generated > >> from the grouping sets. See preprocess_grouping_sets. > >> > >> > > >> >> > >> >> > > >> >> >> select 1 from t group by a, grouping sets ((b), (c)); > >> >> >> > >> >> >> which is however mostly equivalent to (AFAICS) > >> >> >> > >> >> >> select 1 from t group by grouping sets ((a,b), (a,c)) > >> >> >> > >> >> >> so it's not like we have an independent groupClause either I > think. > >> >> >> > >> >> >> The whole point of the original condition is - if there are > grouping > >> >> >> sets, check if at least one can be executed using hashing (i.e. > all > >> keys > >> >> >> are hashable). Otherwise (without grouping sets) look at the > >> grouping as > >> >> >> a whole. > >> >> >> > >> >> >Or if gd is NULL check parse->groupClause. > >> >> > > >> >> > >> >> Which is what I'm saying. > >> >> > >> >> > > >> >> >> I don't see how your change improves this - if there are grouping > >> sets, > >> >> >> it's futile to look at the whole groupClause if at least one > grouping > >> >> >> set can't be hashed. > >> >> >> > >> >> >> But there's a simple way to disprove this - show us a case (table > >> and a > >> >> >> query) where your code correctly allows hashing while the curre
Re: WIP: BRIN multi-range indexes
On Fri, Sep 18, 2020 at 6:27 PM Tomas Vondra wrote: > But maybe we could still use this scheme by actually computing > > h1 = hash_uint32_extended(value, seed1); > h2 = hash_uint32_extended(value, seed2); > > and then use this as the independent hash functions. I think that would > meet the requirements of the paper. Yeah, that would work algorithmically. It would be trivial to add to the patch, too of course. There'd be a higher up-front cpu cost. Also, I'm a bit cautious of rehashing hashes, and whether the two values above are independent enough. I'm not sure either of these points matters. My guess is the partition approach is more sound, but it has some minor organizational challenges (see below). > Why would 11k bits be more than we'd like to store? Assuming we could > use the whole 8kB page for the bloom filter, that'd be about 64k bits. > In practice there'd be a bit of overhead (page header ...) but it's > still much more than 11k bits. Brain fade -- I guess I thought we were avoiding being toasted, but now I see that's not possible for BRIN storage. So, we'll want to guard against this: ERROR: index row size 8160 exceeds maximum 8152 for index "bar_num_idx" While playing around with the numbers I had an epiphany: At the defaults, the filter already takes up ~4.3kB, over half the page. There is no room for another tuple, so if we're only indexing one column, we might as well take up the whole page. Here MT = max tuples per 128 8k pages, or 37120, so default ndistinct is 3712. n k m p MT/10 7 35580 0.01 MT/10 7 64000 0.0005 MT/10 12 64000 0.00025 Assuming ndistinct isn't way off reality, we get 20x-40x lower false positive rate almost for free, and it'd be trivial to code! Keeping k at 7 would be more robust, since it's equivalent to starting with n = ~6000, p = 0.006, which is still almost 2x less false positives than you asked for. It also means nearly doubling the number of sorted values before switching. Going the other direction, capping nbits to 64k bits when ndistinct gets too high, the false positive rate we can actually support starts to drop. Here, the user requested 0.001 fpr. n k p 45009 0.001 60007 0.006 75006 0.017 15000 3 0.129 (probably useless by now) MT 1 0.440 64000 1 0.63 (possible with > 128 pages per range) I imagine smaller pages_per_range settings are going to be useful for skinny tables (note to self -- test). Maybe we could provide a way for the user to see that their combination of pages_per_range, false positive rate, and ndistinct is supportable, like brin_bloom_get_supported_fpr(). Or document to check with page_inspect. And that's not even considering multi-column indexes, like you mentioned. > But I guess we can simply make the table > of primes a bit larger, right? If we want to support all the above cases without falling down entirely, it would have to go up to 32k to be safe (When k = 1 we could degenerate to one modulus on the whole filter). That would be a table of about 7kB, which we could binary search. [thinks for a moment]...Actually, if we wanted to be clever, maybe we could precalculate the primes needed for the 64k bit cases and stick them at the end of the array. The usual algorithm will find them. That way, we could keep the array around 2kB. However, for >8kB block size, we couldn't adjust the 64k number, which might be okay, but not really project policy. We could also generate the primes via a sieve instead, which is really fast (and more code). That would be more robust, but that would require the filter to store the actual primes used, so 20 more bytes at max k = 10. We could hard-code space for that, or to keep from hard-coding maximum k and thus lowest possible false positive rate, we'd need more bookkeeping. So, with the partition approach, we'd likely have to set in stone either max nbits, or min target false positive rate. The latter option seems more principled, not only for the block size, but also since the target fp rate is already fixed by the reloption, and as I've demonstrated, we can often go above and beyond the reloption even without changing k. > >One wrinkle is that the sum of k primes is not going to match m > >exactly. If the sum is too small, we can trim a few bits off of the > >filter bitmap. If the sum is too large, the last partition can spill > >into the front of the first one. This shouldn't matter much in the > >common case since we need to round m to the nearest byte anyway. > > > > AFAIK the paper simply says that as long as the sum of partitions is > close to the requested nbits, it's good enough. So I guess we could just > roll with that, no need to trim/wrap or something like that. Hmm, I'm not sure I understand you. I can see not caring to trim wasted bits, but we can't set/read off the end of the filter. If we don't wrap, we could just skip reading/writing that bit. So a tiny portion of access would be at k - 1. The paper is not clear
Re: pgindent vs dtrace on macos
I wrote: > We still have to deal with > src/backend/utils/sort/qsort_tuple.c > src/pl/plpgsql/src/plerrcodes.h > src/pl/plpython/spiexceptions.h > src/pl/tcl/pltclerrcodes.h > if we want to be entirely clean about this. I took care of those, so I think we're done here. regards, tom lane
Re: recovering from "found xmin ... from before relfrozenxid ..."
On Sun, Sep 20, 2020 at 1:13 PM Tom Lane wrote: > 1. My patch a7212be8b does indeed have a problem. It will allow > vacuum_set_xid_limits to compute freezeLimit = nextXid for a temp > table if freeze_min_age is zero (ie VACUUM FREEZE). If there's > any concurrent transactions, this falls foul of > heap_prepare_freeze_tuple's expectation that > > * NB: cutoff_xid *must* be <= the current global xmin, to ensure that any > * XID older than it could neither be running nor seen as running by any > * open transaction. This ensures that the replacement will not change > * anyone's idea of the tuple state. > > The "cannot freeze committed xmax" error message appears to be banking on > the assumption that we'd not reach heap_prepare_freeze_tuple for any > committed-dead tuple unless its xmax is past the specified cutoff_xid. > > 2. As Amit suspected, there's an inconsistency between pruneheap.c's > rules for which tuples are removable and vacuum.c's rules for that. > This seems like a massive bug in its own right: what's the point of > pruneheap.c going to huge effort to decide whether it should keep a > tuple if vacuum will then kill it anyway? I do not understand why > whoever put in the GlobalVisState stuff only applied it in pruneheap.c > and not VACUUM proper. I am not sure I fully understand why you're contrasting pruneheap.c with vacuum here, because vacuum just does HOT pruning to remove dead tuples - maybe calling the relevant functions with different arguments, but it doesn't have its own independent logic for that. The key point is that the freezing code isn't, or at least historically wasn't, very smart about dead tuples. For example, I think if you told it to freeze something that was dead it would just do it, which is obviously bad. And that's why Andres stuck those sanity checks in there. But it's still pretty fragile. I think perhaps the pruning code should be rewritten in such a way that it can be combined with the code that freezes and marks pages all-visible, so that there's not so much action at a distance, but such an endeavor is in itself pretty scary, and certainly not back-patchable. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Improper use about DatumGetInt32
On Sun, Sep 20, 2020 at 9:17 PM Hou, Zhijie wrote: > In (/contrib/bloom/blutils.c:277), I found it use DatumGetInt32 to get UInt32 > type. > Is it more appropriate to use DatumGetUInt32 here? Typically, the DatumGetBlah() function that you pick should match the SQL data type that the function is returning. So if the function returns pg_catalog.int4, which corresponds to the C data type int32, you would use DatumGetInt32. There is no SQL type corresponding to the C data type uint32, so I'm not sure why we even have DatumGetUInt32. I'm sort of suspicious that there's some fuzzy thinking going on there. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: pgindent vs dtrace on macos
> On 21 Sep 2020, at 19:59, Tom Lane wrote: > > I wrote: >> We still have to deal with >> src/backend/utils/sort/qsort_tuple.c >> src/pl/plpgsql/src/plerrcodes.h >> src/pl/plpython/spiexceptions.h >> src/pl/tcl/pltclerrcodes.h >> if we want to be entirely clean about this. > > I took care of those, so I think we're done here. Aha, thanks! They were still on my TODO but I got distracted by a shiny object or two. Thanks for fixing. cheers ./daniel
Re: OpenSSL 3.0.0 compatibility
> On 19 Sep 2020, at 04:11, Michael Paquier wrote: > On Fri, Sep 18, 2020 at 04:11:13PM +0200, Daniel Gustafsson wrote: >> The other problem was that the cipher context >> padding must be explicitly set, whereas in previous versions relying on the >> default worked fine. EVP_CIPHER_CTX_set_padding always returns 1 so thats >> why >> it isn't checking the returnvalue as the other nearby initialization calls. > > It seems to me that it would be a good idea to still check for the > return value of EVP_CIPHER_CTX_set_padding() and just return with > a PXE_CIPHER_INIT. By looking at the upstream code, it is true that > it always returns true for <= 1.1.1, but that's not the case for > 3.0.0. Some code paths of upstream also check after it. Thats a good point, it's now provider dependent and can thus fail in case the provider isn't supplying the functionality. We've already loaded a provider where we know the call is supported, but it's of course better to check. Fixed in the attached v2. I was only reading the docs and not the code, and they haven't been updated to reflect this. I'll open a PR to the OpenSSL devs to fix that. > Also, what's the impact with disabling the padding for <= 1.1.1? This > part of the upstream code is still a bit obscure to me.. I'm far from fluent in OpenSSL, but AFAICT it has to do with the new provider API. The default value for padding is unchanged, but it needs to be propaged into the provider to be set in the context where the old code picked it up automatically. The relevant OpenSSL commit (83f68df32f0067ee7b0) which changes the docs to say the function should be called doesn't contain enough information to explain why. >> To avoid problems with the by LibreSSL overloaded OPENSSL_VERSION_NUMBER >> macro >> (which too is deprecated in 3.0.0), I used the new macro which is only set in >> 3.0.0. Not sure if that's considered acceptable or if we should invent our >> own >> version macro in autoconf. > > OSSL_PROVIDER_load() is new as of 3.0.0, so using a configure switch > similarly as what we do for the other functions should be more > consistent and enough, no? Good point, fixed in the attached. >> For the main SSL tests, the incorrect password test has a new errormessage >> which is added in 0002. > > Hmm. I am linking to a build of alpha6 here, but I still see the > error being reported as a bad decrypt for this test. Interesting. Turns out it was coming from when I tested against OpenSSL git HEAD, so it may come in alpha7 (or whatever the next version will be). Let's disregard this for now until dust settles, I've dropped the patch from the series. cheers ./daniel 0001-Fix-OpenSSL-3-support-in-pgcrypto-v2.patch Description: Binary data
Re: recovering from "found xmin ... from before relfrozenxid ..."
Robert Haas writes: > On Sun, Sep 20, 2020 at 1:13 PM Tom Lane wrote: >> 2. As Amit suspected, there's an inconsistency between pruneheap.c's >> rules for which tuples are removable and vacuum.c's rules for that. >> This seems like a massive bug in its own right: what's the point of >> pruneheap.c going to huge effort to decide whether it should keep a >> tuple if vacuum will then kill it anyway? I do not understand why >> whoever put in the GlobalVisState stuff only applied it in pruneheap.c >> and not VACUUM proper. > I am not sure I fully understand why you're contrasting pruneheap.c > with vacuum here, because vacuum just does HOT pruning to remove dead > tuples - maybe calling the relevant functions with different > arguments, but it doesn't have its own independent logic for that. Right, but what we end up with is that the very same tuple xmin and xmax might result in pruning/deletion, or not, depending on whether it's part of a HOT chain or not. That's at best pretty weird, and at worst it means that corner-case bugs in other places are triggered in only one of the two scenarios ... which is what we have here. > The key point is that the freezing code isn't, or at least > historically wasn't, very smart about dead tuples. For example, I > think if you told it to freeze something that was dead it would just > do it, which is obviously bad. And that's why Andres stuck those > sanity checks in there. But it's still pretty fragile. I think perhaps > the pruning code should be rewritten in such a way that it can be > combined with the code that freezes and marks pages all-visible, so > that there's not so much action at a distance, but such an endeavor is > in itself pretty scary, and certainly not back-patchable. Not sure. The pruning code is trying to serve two masters, that is both VACUUM and on-the-fly cleanup during ordinary queries. If you try to merge it with other tasks that VACUUM does, you're going to have a mess for the second usage. I fear there's going to be pretty strong conservation of cruft either way. FWIW, weakening the sanity checks in heap_prepare_freeze_tuple is *not* my preferred fix here. But it'll take some work in other places to preserve them. regards, tom lane
Re: Improper use about DatumGetInt32
Robert Haas writes: > Typically, the DatumGetBlah() function that you pick should match the > SQL data type that the function is returning. So if the function > returns pg_catalog.int4, which corresponds to the C data type int32, > you would use DatumGetInt32. There is no SQL type corresponding to the > C data type uint32, so I'm not sure why we even have DatumGetUInt32. xid? regards, tom lane
Re: pgindent vs dtrace on macos
Oh wait, I forgot about the fmgrprotos.h discrepancy. I wrote: > It strikes me that a low-cost workaround would be to rename these > C functions. There's no law that their C names must match the > SQL names. Here's a proposed patch to fix it that way. regards, tom lane diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c index 27d65557df..2cc2da60eb 100644 --- a/src/backend/utils/adt/numeric.c +++ b/src/backend/utils/adt/numeric.c @@ -1109,7 +1109,7 @@ numeric_support(PG_FUNCTION_ARGS) * scale of the attribute have to be applied on the value. */ Datum -numeric (PG_FUNCTION_ARGS) +numeric_apply_typmod(PG_FUNCTION_ARGS) { Numeric num = PG_GETARG_NUMERIC(0); int32 typmod = PG_GETARG_INT32(1); diff --git a/src/backend/utils/adt/oracle_compat.c b/src/backend/utils/adt/oracle_compat.c index 76e666474e..79ba28671f 100644 --- a/src/backend/utils/adt/oracle_compat.c +++ b/src/backend/utils/adt/oracle_compat.c @@ -923,7 +923,7 @@ ascii(PG_FUNCTION_ARGS) / Datum -chr (PG_FUNCTION_ARGS) +chr_code_to_text(PG_FUNCTION_ARGS) { uint32 cvalue = PG_GETARG_UINT32(0); text *result; diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index f48f5fb4d9..5eb293ee8f 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -3399,7 +3399,7 @@ prosrc => 'ascii' }, { oid => '1621', descr => 'convert int4 to char', proname => 'chr', prorettype => 'text', proargtypes => 'int4', - prosrc => 'chr' }, + prosrc => 'chr_code_to_text' }, { oid => '1622', descr => 'replicate string n times', proname => 'repeat', prorettype => 'text', proargtypes => 'text int4', prosrc => 'repeat' }, @@ -4210,7 +4210,8 @@ proargtypes => 'internal', prosrc => 'numeric_support' }, { oid => '1703', descr => 'adjust numeric to typmod precision/scale', proname => 'numeric', prosupport => 'numeric_support', - prorettype => 'numeric', proargtypes => 'numeric int4', prosrc => 'numeric' }, + prorettype => 'numeric', proargtypes => 'numeric int4', + prosrc => 'numeric_apply_typmod' }, { oid => '1704', proname => 'numeric_abs', prorettype => 'numeric', proargtypes => 'numeric', prosrc => 'numeric_abs' },
Re: pgindent vs dtrace on macos
On 2020-Sep-21, Tom Lane wrote: > Oh wait, I forgot about the fmgrprotos.h discrepancy. > > I wrote: > > It strikes me that a low-cost workaround would be to rename these > > C functions. There's no law that their C names must match the > > SQL names. > > Here's a proposed patch to fix it that way. pgtypes_numeric.h still contains typedef struct { int ndigits;/* number of digits in digits[] - can be 0! */ int weight; /* weight of first digit */ int rscale; /* result scale */ int dscale; /* display scale */ int sign; /* NUMERIC_POS, NUMERIC_NEG, or NUMERIC_NAN */ NumericDigit *buf; /* start of alloc'd space for digits[] */ NumericDigit *digits; /* decimal digits */ } numeric; ... isn't this more likely to create a typedef entry than merely a function name? -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: pgindent vs dtrace on macos
Alvaro Herrera writes: > On 2020-Sep-21, Tom Lane wrote: >> Here's a proposed patch to fix it that way. > pgtypes_numeric.h still contains > typedef struct > { > } numeric; > ... isn't this more likely to create a typedef entry than merely a > function name? Well, yeah, it *is* a typedef. My proposal is to rename the C function to avoid the conflict, rather than renaming the typedef. Given the small number of direct calls (none), that's a lot less work. Also, I think pgtypes_numeric.h is exposed to ecpg client code, so changing that typedef's name could be quite problematic. regards, tom lane
Re: pgindent vs dtrace on macos
On 2020-Sep-21, Tom Lane wrote: > > ... isn't this more likely to create a typedef entry than merely a > > function name? > > Well, yeah, it *is* a typedef. My proposal is to rename the C function > to avoid the conflict, rather than renaming the typedef. Given the > small number of direct calls (none), that's a lot less work. Also, > I think pgtypes_numeric.h is exposed to ecpg client code, so changing > that typedef's name could be quite problematic. Ah, of course. The idea of adding the names to pgindent's %blacklist results in severe uglification, particularly in the regex code, so +1 for your workaround. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Improper use about DatumGetInt32
Hi, On 2020-09-21 14:08:22 -0400, Robert Haas wrote: > There is no SQL type corresponding to the C data type uint32, so I'm > not sure why we even have DatumGetUInt32. I'm sort of suspicious that > there's some fuzzy thinking going on there. I think we mostly use it for the few places where we currently expose data as a signed integer on the SQL level, but internally actually treat it as a unsigned data. There's not a lot of those, but there e.g. is pg_class.relpages. There also may be places where we use it for functions that can be created but not called from SQL (using the INTERNAL type). - Andres
Re: WIP: BRIN multi-range indexes
On Mon, Sep 21, 2020 at 01:42:34PM -0400, John Naylor wrote: On Fri, Sep 18, 2020 at 6:27 PM Tomas Vondra wrote: But maybe we could still use this scheme by actually computing h1 = hash_uint32_extended(value, seed1); h2 = hash_uint32_extended(value, seed2); and then use this as the independent hash functions. I think that would meet the requirements of the paper. Yeah, that would work algorithmically. It would be trivial to add to the patch, too of course. There'd be a higher up-front cpu cost. Also, I'm a bit cautious of rehashing hashes, and whether the two values above are independent enough. I'm not sure either of these points matters. My guess is the partition approach is more sound, but it has some minor organizational challenges (see below). OK. I don't think rehashing hashes is an issue as long as the original hash has sufficiently low collision rate (and while we know it's not perfect we know it works well enough for hash indexes etc.). And I doubt the cost of the extra hash of uint32 would be noticeable. That being said the partitioning approach might be more sound and it's definitely worth giving it a try. Why would 11k bits be more than we'd like to store? Assuming we could use the whole 8kB page for the bloom filter, that'd be about 64k bits. In practice there'd be a bit of overhead (page header ...) but it's still much more than 11k bits. Brain fade -- I guess I thought we were avoiding being toasted, but now I see that's not possible for BRIN storage. So, we'll want to guard against this: ERROR: index row size 8160 exceeds maximum 8152 for index "bar_num_idx" While playing around with the numbers I had an epiphany: At the defaults, the filter already takes up ~4.3kB, over half the page. There is no room for another tuple, so if we're only indexing one column, we might as well take up the whole page. Hmm, yeah. I may be wrong but IIRC indexes don't support external storage but compression is still allowed. So even if those defaults are a bit higher than needed that should make the bloom filters a bit more compressible, and thus fit multiple BRIN tuples on a single page. Here MT = max tuples per 128 8k pages, or 37120, so default ndistinct is 3712. n k m p MT/10 7 35580 0.01 MT/10 7 64000 0.0005 MT/10 12 64000 0.00025 Assuming ndistinct isn't way off reality, we get 20x-40x lower false positive rate almost for free, and it'd be trivial to code! Keeping k at 7 would be more robust, since it's equivalent to starting with n = ~6000, p = 0.006, which is still almost 2x less false positives than you asked for. It also means nearly doubling the number of sorted values before switching. Going the other direction, capping nbits to 64k bits when ndistinct gets too high, the false positive rate we can actually support starts to drop. Here, the user requested 0.001 fpr. n k p 45009 0.001 60007 0.006 75006 0.017 15000 3 0.129 (probably useless by now) MT 1 0.440 64000 1 0.63 (possible with > 128 pages per range) I imagine smaller pages_per_range settings are going to be useful for skinny tables (note to self -- test). Maybe we could provide a way for the user to see that their combination of pages_per_range, false positive rate, and ndistinct is supportable, like brin_bloom_get_supported_fpr(). Or document to check with page_inspect. And that's not even considering multi-column indexes, like you mentioned. I agree giving users visibility into this would be useful. Not sure about how much we want to rely on these optimizations, though, considering multi-column indexes kinda break this. But I guess we can simply make the table of primes a bit larger, right? If we want to support all the above cases without falling down entirely, it would have to go up to 32k to be safe (When k = 1 we could degenerate to one modulus on the whole filter). That would be a table of about 7kB, which we could binary search. [thinks for a moment]...Actually, if we wanted to be clever, maybe we could precalculate the primes needed for the 64k bit cases and stick them at the end of the array. The usual algorithm will find them. That way, we could keep the array around 2kB. However, for >8kB block size, we couldn't adjust the 64k number, which might be okay, but not really project policy. We could also generate the primes via a sieve instead, which is really fast (and more code). That would be more robust, but that would require the filter to store the actual primes used, so 20 more bytes at max k = 10. We could hard-code space for that, or to keep from hard-coding maximum k and thus lowest possible false positive rate, we'd need more bookkeeping. I don't think the efficiency of this code matters too much - it's only used once when creating the index, so the simpler the better. Certainly for now, while testing the partitioning approach. So, with the partition approach, we'd likely have to set in stone either max nbits, or min targ
Re: Feature improvement for pg_stat_statements
Hi, On 2020-09-18 17:55:12 +0900, btnakamichin wrote: > I’m thinking of adding adding a function called > pg_stat_statements_reset_time() that returns the last timestamp when > executed pg_stat_statements_reset(). pg_stat_statements can reset each SQL > statement. We can record each sql reset timing timestamp but I don’t feel > the need. This time, I’m thinking of updating the reset timestamp only when > all stats were reset. What exactly do you mean by "can reset each SQL statement"? I don't really see what alternative you're analyzing? It does make sense to me to have a function returning the time of the last reset. Greetings, Andres Freund
Re: recovering from "found xmin ... from before relfrozenxid ..."
On Mon, Sep 21, 2020 at 2:21 PM Tom Lane wrote: > Right, but what we end up with is that the very same tuple xmin and > xmax might result in pruning/deletion, or not, depending on whether > it's part of a HOT chain or not. That's at best pretty weird, and > at worst it means that corner-case bugs in other places are triggered > in only one of the two scenarios ... which is what we have here. I'm not sure I really understand how that's happening, because surely HOT chains and non-HOT chains are pruned by the same code, but it doesn't sound good. > FWIW, weakening the sanity checks in heap_prepare_freeze_tuple is > *not* my preferred fix here. But it'll take some work in other > places to preserve them. Make sense. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: recovering from "found xmin ... from before relfrozenxid ..."
Hi, On 2020-09-21 16:02:29 -0400, Robert Haas wrote: > On Mon, Sep 21, 2020 at 2:21 PM Tom Lane wrote: > > Right, but what we end up with is that the very same tuple xmin and > > xmax might result in pruning/deletion, or not, depending on whether > > it's part of a HOT chain or not. That's at best pretty weird, and > > at worst it means that corner-case bugs in other places are triggered > > in only one of the two scenarios ... which is what we have here. > > I'm not sure I really understand how that's happening, because surely > HOT chains and non-HOT chains are pruned by the same code, but it > doesn't sound good. Not necessarily, unfortunately: case HEAPTUPLE_DEAD: /* * Ordinarily, DEAD tuples would have been removed by * heap_page_prune(), but it's possible that the tuple * state changed since heap_page_prune() looked. In * particular an INSERT_IN_PROGRESS tuple could have * changed to DEAD if the inserter aborted. So this * cannot be considered an error condition. * * If the tuple is HOT-updated then it must only be * removed by a prune operation; so we keep it just as if * it were RECENTLY_DEAD. Also, if it's a heap-only * tuple, we choose to keep it, because it'll be a lot * cheaper to get rid of it in the next pruning pass than * to treat it like an indexed tuple. Finally, if index * cleanup is disabled, the second heap pass will not * execute, and the tuple will not get removed, so we must * treat it like any other dead tuple that we choose to * keep. * * If this were to happen for a tuple that actually needed * to be deleted, we'd be in trouble, because it'd * possibly leave a tuple below the relation's xmin * horizon alive. heap_prepare_freeze_tuple() is prepared * to detect that case and abort the transaction, * preventing corruption. */ if (HeapTupleIsHotUpdated(&tuple) || HeapTupleIsHeapOnly(&tuple) || params->index_cleanup == VACOPT_TERNARY_DISABLED) nkeep += 1; else tupgone = true; /* we can delete the tuple */ all_visible = false; So if e.g. a transaction aborts between the heap_page_prune and this check the pruning behaviour depends on whether the tuple is part of a HOT chain or not. Greetings, Andres Freund
Re: recovering from "found xmin ... from before relfrozenxid ..."
Robert Haas writes: > On Mon, Sep 21, 2020 at 2:21 PM Tom Lane wrote: >> Right, but what we end up with is that the very same tuple xmin and >> xmax might result in pruning/deletion, or not, depending on whether >> it's part of a HOT chain or not. That's at best pretty weird, and >> at worst it means that corner-case bugs in other places are triggered >> in only one of the two scenarios ... which is what we have here. > I'm not sure I really understand how that's happening, because surely > HOT chains and non-HOT chains are pruned by the same code, but it > doesn't sound good. No, they're not. lazy_scan_heap() will never remove a tuple that is HeapTupleIsHotUpdated or HeapTupleIsHeapOnly, even if it thinks it's DEAD -- cf. vacuumlazy.c, about line 1350. So tuples in a HOT chain are deleted exactly when pruneheap.c sees fit to do so. OTOH, for tuples not in a HOT chain, the decision is (I believe) entirely on lazy_scan_heap(). And the core of my complaint is that pruneheap.c's decisions about what is DEAD are not reliably identical to what HeapTupleSatisfiesVacuum thinks. I don't mind if a free-standing prune operation has its own rules, but when it's invoked by VACUUM it ought to follow VACUUM's rules about what is dead or alive. What remains unclear at this point is whether we ought to import some of the intelligence added by the GlobalVisState patch into VACUUM's behavior, or just dumb down pruneheap.c so that it applies exactly the HeapTupleSatisfiesVacuum rules when invoked by VACUUM. regards, tom lane
Re: recovering from "found xmin ... from before relfrozenxid ..."
Hi, On 2020-09-20 13:13:16 -0400, Tom Lane wrote: > 2. As Amit suspected, there's an inconsistency between pruneheap.c's > rules for which tuples are removable and vacuum.c's rules for that. > This seems like a massive bug in its own right: what's the point of > pruneheap.c going to huge effort to decide whether it should keep a > tuple if vacuum will then kill it anyway? I do not understand why > whoever put in the GlobalVisState stuff only applied it in pruneheap.c > and not VACUUM proper. The reason for that is that the GlobalVisState stuff is computed heuristically (and then re-checked if that's not sufficient to prune a tuple, unless already done so). That's done so GetSnapshotData() doesn't have to look at each backends ->xmin, which is quite a massive speedup at higher connection counts, as each backends ->xmin changes much more often than each backend's xid. But for VACUUM we need to do the accurate scan of the procarray anyway, because we need an accurate value for things other than HOT pruning decisions. What do you exactly mean with the "going to huge effort to decide" bit? > I think to move forward, we need to figure out what the freezing > behavior ought to be for temp tables. We could make it the same > as it was before a7212be8b, which'd just require some more complexity > in vacuum_set_xid_limits. However, that negates the idea that we'd > like VACUUM's behavior on a temp table to be fully independent of > whether concurrent transactions exist. I'd prefer to allow a7212be8b's > behavior to stand, but then it seems we need to lobotomize the error > check in heap_prepare_freeze_tuple to some extent. I think that's an argument for what I suggested elsewhere, which is that we should move the logic for a different horizon for temp tables out of vacuum_set_xid_limits, and into procarray. > Independently of that, it seems like we need to fix things so that > when pruneheap.c is called by vacuum, it makes EXACTLY the same > dead-or-not-dead decisions that the main vacuum code makes. This > business with applying some GlobalVisState rule or other instead > seems just as unsafe as can be. It's not great, I agree. Not sure there is a super nice answer though. Note that, even before my changes, vacuumlazy can decide differently than pruning whether a tuple is live. E.g. when an inserting transaction aborts. That's pretty much unavoidable as long as we have multiple HTSV calls for a tuple, since none of our locking can (nor should) prevent concurrent transactions from aborting. Before your new code avoiding the GetOldestNonRemovableTransactionId() call for temp tables, the GlobalVis* can never be more pessimistic than decisions based ona prior GetOldestNonRemovableTransactionId call (as that internally updates the heuristic horizons used by GlobalVis). Greetings, Andres Freund
Re: recovering from "found xmin ... from before relfrozenxid ..."
Andres Freund writes: > The reason for that is that the GlobalVisState stuff is computed > heuristically (and then re-checked if that's not sufficient to prune a > tuple, unless already done so). That's done so GetSnapshotData() doesn't > have to look at each backends ->xmin, which is quite a massive speedup > at higher connection counts, as each backends ->xmin changes much more > often than each backend's xid. OK. > What do you exactly mean with the "going to huge effort to decide" bit? I'd looked at all the complexity around GlobalVisState, but failed to register that it should be pretty cheap on a per-tuple basis. So never mind that complaint. The point that remains is just that it's different from HeapTupleSatisfiesVacuum's rules. >> I think to move forward, we need to figure out what the freezing >> behavior ought to be for temp tables. We could make it the same >> as it was before a7212be8b, which'd just require some more complexity >> in vacuum_set_xid_limits. However, that negates the idea that we'd >> like VACUUM's behavior on a temp table to be fully independent of >> whether concurrent transactions exist. I'd prefer to allow a7212be8b's >> behavior to stand, but then it seems we need to lobotomize the error >> check in heap_prepare_freeze_tuple to some extent. > I think that's an argument for what I suggested elsewhere, which is that > we should move the logic for a different horizon for temp tables out of > vacuum_set_xid_limits, and into procarray. But procarray does not seem like a great place for table-persistence-dependent decisions either? >> Independently of that, it seems like we need to fix things so that >> when pruneheap.c is called by vacuum, it makes EXACTLY the same >> dead-or-not-dead decisions that the main vacuum code makes. This >> business with applying some GlobalVisState rule or other instead >> seems just as unsafe as can be. > It's not great, I agree. Not sure there is a super nice answer > though. Note that, even before my changes, vacuumlazy can decide > differently than pruning whether a tuple is live. E.g. when an inserting > transaction aborts. That's pretty much unavoidable as long as we have > multiple HTSV calls for a tuple, since none of our locking can (nor > should) prevent concurrent transactions from aborting. It's clear that if the environment changes between test A and test B, we might get different results. What I'm not happy about is that the rules are different, so we might get different results even if the environment did not change. regards, tom lane
Re: Parallel Full Hash Join
On Wed, Sep 11, 2019 at 11:23 PM Thomas Munro wrote: > > While thinking about looping hash joins (an alternative strategy for > limiting hash join memory usage currently being investigated by > Melanie Plageman in a nearby thread[1]), the topic of parallel query > deadlock hazards came back to haunt me. I wanted to illustrate the > problems I'm aware of with the concrete code where I ran into this > stuff, so here is a new-but-still-broken implementation of $SUBJECT. > This was removed from the original PHJ submission when I got stuck and > ran out of time in the release cycle for 11. Since the original > discussion is buried in long threads and some of it was also a bit > confused, here's a fresh description of the problems as I see them. > Hopefully these thoughts might help Melanie's project move forward, > because it's closely related, but I didn't want to dump another patch > into that other thread. Hence this new thread. > > I haven't succeeded in actually observing a deadlock with the attached > patch (though I did last year, very rarely), but I also haven't tried > very hard. The patch seems to produce the right answers and is pretty > scalable, so it's really frustrating not to be able to get it over the > line. > > Tuple queue deadlock hazard: > > If the leader process is executing the subplan itself and waiting for > all processes to arrive in ExecParallelHashEndProbe() (in this patch) > while another process has filled up its tuple queue and is waiting for > the leader to read some tuples an unblock it, they will deadlock > forever. That can't happen in the the committed version of PHJ, > because it never waits for barriers after it has begun emitting > tuples. > > Some possible ways to fix this: > > 1. You could probably make it so that the PHJ_BATCH_SCAN_INNER phase > in this patch (the scan for unmatched tuples) is executed by only one > process, using the "detach-and-see-if-you-were-last" trick. Melanie > proposed that for an equivalent problem in the looping hash join. I > think it probably works, but it gives up a lot of parallelism and thus > won't scale as nicely as the attached patch. > I have attached a patch which implements this (v1-0001-Parallel-FOJ-ROJ-single-worker-scan-buckets.patch). For starters, in order to support parallel FOJ and ROJ, I re-enabled setting the match bit for the tuples in the hashtable which 3e4818e9dd5be294d97c disabled. I did so using the code suggested in [1], reading the match bit to see if it is already set before setting it. Then, workers except for the last worker detach after exhausting the outer side of a batch, leaving one worker to proceed to HJ_FILL_INNER and do the scan of the hash table and emit unmatched inner tuples. I have also attached a variant on this patch which I am proposing to replace it (v1-0001-Parallel-FOJ-ROJ-single-worker-scan-chunks.patch) which has a new ExecParallelScanHashTableForUnmatched() in which the single worker doing the unmatched scan scans one HashMemoryChunk at a time and then frees them as it goes. I thought this might perform better than the version which uses the buckets because 1) it should do a bit less pointer chasing and 2) it frees each chunk of the hash table as it scans it which (maybe) would save a bit of time during ExecHashTableDetachBatch() when it goes through and frees the hash table, but, my preliminary tests showed a negligible difference between this and the version using buckets. I will do a bit more testing, though. I tried a few other variants of these patches, including one in which the workers detach from the batch inside of the batch loading and probing phase machine, ExecParallelHashJoinNewBatch(). This meant that all workers transition to HJ_FILL_INNER and then HJ_NEED_NEW_BATCH in order to detach in the batch phase machine. This, however, involved adding a lot of new variables to distinguish whether or or not the unmatched outer scan was already done, whether or not the current worker was the worker elected to do the scan, etc. Overall, it is probably incorrect to use the HJ_NEED_NEW_BATCH state in this way. I had originally tried this to avoid operating on the batch_barrier in the main hash join state machine. I've found that the more different places we add code attaching and detaching to the batch_barrier (and other PHJ barriers, for that matter), the harder it is to debug the code, however, I think in this case it is required. > 2. You could probably make it so that only the leader process drops > out of executing the inner unmatched scan, and then I think you > wouldn't have this very specific problem at the cost of losing some > (but not all) parallelism (ie the leader), but there might be other > variants of the problem. For example, a GatherMerge leader process > might be blocked waiting for the next tuple for a tuple from P1, while > P2 is try to write to a full queue, and P1 waits for P2. > > 3. You could introduce some kind of overflow for tuple queues, so >
Re: recovering from "found xmin ... from before relfrozenxid ..."
Hi, On 2020-09-21 16:40:40 -0400, Tom Lane wrote: > Andres Freund writes: > >> I think to move forward, we need to figure out what the freezing > >> behavior ought to be for temp tables. We could make it the same > >> as it was before a7212be8b, which'd just require some more complexity > >> in vacuum_set_xid_limits. However, that negates the idea that we'd > >> like VACUUM's behavior on a temp table to be fully independent of > >> whether concurrent transactions exist. I'd prefer to allow a7212be8b's > >> behavior to stand, but then it seems we need to lobotomize the error > >> check in heap_prepare_freeze_tuple to some extent. > > > I think that's an argument for what I suggested elsewhere, which is that > > we should move the logic for a different horizon for temp tables out of > > vacuum_set_xid_limits, and into procarray. > > But procarray does not seem like a great place for > table-persistence-dependent decisions either? That ship has sailed a long long time ago though. GetOldestXmin() has looked at the passed in relation for a quite a while, and even before that we had logic about 'allDbs' etc. It doesn't easily seem possible to avoid that, given how intimately that's coupled with how snapshots are built and used, database & vacuumFlags checks etc. Greetings, Andres Freund
Re: recovering from "found xmin ... from before relfrozenxid ..."
Andres Freund writes: > On 2020-09-21 16:40:40 -0400, Tom Lane wrote: >> Andres Freund writes: >>> I think that's an argument for what I suggested elsewhere, which is that >>> we should move the logic for a different horizon for temp tables out of >>> vacuum_set_xid_limits, and into procarray. >> But procarray does not seem like a great place for >> table-persistence-dependent decisions either? > That ship has sailed a long long time ago though. GetOldestXmin() has > looked at the passed in relation for a quite a while, and even before > that we had logic about 'allDbs' etc. It doesn't easily seem possible > to avoid that, given how intimately that's coupled with how snapshots > are built and used, database & vacuumFlags checks etc. OK. Given that you've got strong feelings about this, do you want to propose a patch? I'm happy to fix it, since it's at least in part my bug, but I probably won't do it exactly like you would. regards, tom lane
Re: Handing off SLRU fsyncs to the checkpointer
On Mon, Sep 21, 2020 at 2:19 PM Thomas Munro wrote: > While scanning for comments and identifier names that needed updating, > I realised that this patch changed the behaviour of the ShutdownXXX() > functions, since they currently flush the SLRUs but are not followed > by a checkpoint. I'm not entirely sure I understand the logic of > that, but it wasn't my intention to change it. So here's a version > that converts the existing fsync_fname() to fsync_fname_recurse() to Bleugh, that was probably a bad idea, it's too expensive. But it forces me to ask the question: *why* do we need to call Shutdown{CLOG,CommitTS,SUBTRANS, MultiXact}() after a creating a shutdown checkpoint? I wondered if this might date from before the WAL, but I see that the pattern was introduced when the CLOG was moved out of shared buffers into a proto-SLRU in ancient commit 2589735da08, but even in that commit the preceding CreateCheckPoint() call included a call to CheckPointCLOG().
Re: new heapcheck contrib module
On Tue, Aug 25, 2020 at 10:36 AM Mark Dilger wrote: > Thanks for the review! + msg OUT text + ) Looks like atypical formatting. +REVOKE ALL ON FUNCTION +verify_heapam(regclass, boolean, boolean, cstring, bigint, bigint) +FROM PUBLIC; This too. +-- Don't want this to be available to public Add "by default, but superusers can grant access" or so? I think there should be a call to pg_class_aclcheck() here, just like the one in pg_prewarm, so that if the superuser does choose to grant access, users given access can check tables they anyway have permission to access, but not others. Maybe put that in check_relation_relkind_and_relam() and rename it. Might want to look at the pg_surgery precedent, too. Oh, and that functions header comment is also wrong. I think that the way the checks on the block range are performed could be improved. Generally, we want to avoid reporting the same problem with a variety of different message strings, because it adds burden for translators and is potentially confusing for users. You've got two message strings that are only going to be used for empty relations and a third message string that is only going to be used for non-empty relations. What stops you from just ripping off the way that this is done in pg_prewarm, which requires only 2 messages? Then you'd be adding a net total of 0 new messages instead of 3, and in my view they would be clearer than your third message, "block range is out of bounds for relation with block count %u: " INT64_FORMAT " .. " INT64_FORMAT, which doesn't say very precisely what the problem is, and also falls afoul of our usual practice of avoiding the use of INT64_FORMAT in error messages that are subject to translation. I notice that pg_prewarm just silently does nothing if the start and end blocks are swapped, rather than generating an error. We could choose to do differently here, but I'm not sure why we should bother. + all_frozen = mapbits & VISIBILITYMAP_ALL_VISIBLE; + all_visible = mapbits & VISIBILITYMAP_ALL_FROZEN; + + if ((all_frozen && skip_option == SKIP_PAGES_ALL_FROZEN) || + (all_visible && skip_option == SKIP_PAGES_ALL_VISIBLE)) + { + continue; + } This isn't horrible style, but why not just get rid of the local variables? e.g. if (skip_option == SKIP_PAGES_ALL_FROZEN) { if ((mapbits & VISIBILITYMAP_ALL_FROZEN) != 0) continue; } else { ... } Typically no braces around a block containing only one line. + * table contains corrupt all frozen bits, a concurrent vacuum might skip the all-frozen? + * relfrozenxid beyond xid.) Reporting the xid as valid under such conditions + * seems acceptable, since if we had checked it earlier in our scan it would + * have truly been valid at that time, and we break no MVCC guarantees by + * failing to notice the concurrent change in its status. I agree with the first half of this sentence, but I don't know what MVCC guarantees have to do with anything. I'd just delete the second part, or make it a lot clearer. + * Some kinds of tuple header corruption make it unsafe to check the tuple + * attributes, for example when the tuple is foreshortened and such checks + * would read beyond the end of the line pointer (and perhaps the page). In I think of foreshortening mostly as an art term, though I guess it has other meanings. Maybe it would be clearer to say something like "Some kinds of corruption make it unsafe to check the tuple attributes, for example when the line pointer refers to a range of bytes outside the page"? + * Other kinds of tuple header corruption do not bare on the question of bear + pstrdup(_("updating transaction ID marked incompatibly as keys updated and locked only"))); + pstrdup(_("updating transaction ID marked incompatibly as committed and as a multitransaction ID"))); "updating transaction ID" might scare somebody who thinks that you are telling them that you changed something. That's not what it means, but it might not be totally clear. Maybe: tuple is marked as only locked, but also claims key columns were updated multixact should not be marked committed + psprintf(_("data offset differs from expected: %u vs. %u (1 attribute, has nulls)"), For these, how about: tuple data should begin at byte %u, but actually begins at byte %u (1 attribute, has nulls) etc. + psprintf(_("old-style VACUUM FULL transaction ID is in the future: %u"), + psprintf(_("old-style VACUUM FULL transaction ID precedes freeze threshold: %u"), + psprintf(_("old-style VACUUM FULL transaction ID is invalid in this relation: %u"), old-style VACUUM FULL transaction ID %u is in the future old-style VACUUM FULL transact
Re: Command statistics system (cmdstats)
On Mon, Sep 21, 2020 at 9:41 AM Alvaro Herrera wrote: > "A couple of weeks" of inactivity is not sufficient, in my view, to boot > a patch out of the commitfest process. Whenever the patch is > resurrected, it will be a new entry which won't have the history that it > had accumulated in the long time since it was created -- which biases > it against other new patches. As Michael said, it had been inactive for three *months*. I don't think there's a big problem here. I think that the redesign of the CommitFest application encourages carrying things along from CF to CF forever, instead of getting rid of things that aren't wanted or aren't making any progress. That's work we don't need. There ought to be a way to mark a patch RwF when nothing's happen that lets it be revived later if and when someone gets around to resubmitting, but I think that's not possible now. Then, too, since we have email integration now, maybe we ought to do that automatically if the thread isn't getting updated and the patch is setting there waiting on author. It's a real waste of CFM time to chase down and kick out obviously-inactive patches, and if the CFM is going to get flack for it then that's even worse. Like, do we have 170 patches now because we have more activity than a few years ago, or just because we've become more reluctant to boot things? I don't know, but to the extent it's from unwillingness to boot things, I think that's bad. Discouraging contributors is not good, but wasting CFM and commiter time is also bad. You've got to draw a line someplace or you'll go nuts. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: recovering from "found xmin ... from before relfrozenxid ..."
On 2020-09-21 17:03:53 -0400, Tom Lane wrote: > Andres Freund writes: > > On 2020-09-21 16:40:40 -0400, Tom Lane wrote: > >> Andres Freund writes: > >>> I think that's an argument for what I suggested elsewhere, which is that > >>> we should move the logic for a different horizon for temp tables out of > >>> vacuum_set_xid_limits, and into procarray. > > >> But procarray does not seem like a great place for > >> table-persistence-dependent decisions either? > > > That ship has sailed a long long time ago though. GetOldestXmin() has > > looked at the passed in relation for a quite a while, and even before > > that we had logic about 'allDbs' etc. It doesn't easily seem possible > > to avoid that, given how intimately that's coupled with how snapshots > > are built and used, database & vacuumFlags checks etc. > > OK. Given that you've got strong feelings about this, do you want to > propose a patch? I'm happy to fix it, since it's at least in part my > bug, but I probably won't do it exactly like you would. I can give it a try. I can see several paths of varying invasiveness, not sure yet what the best approach is. Let me think about if for a bit. Greetings, Andres Freund
Re: Compatible defaults for LEAD/LAG
Pavel Stehule writes: > I see few possibilities how to finish and close this issue: > 1. use anycompatible type and add note to documentation so expression of > optional argument can change a result type, and so this is Postgres > specific - other databases and ANSI SQL disallow this. > It is the most simple solution, and it is good enough for this case, that > is not extra important. > 2. choose the correct type somewhere inside the parser - for these two > functions. > 3. introduce and implement some generic solution for this case - it can be > implemented on C level via some function helper or on syntax level >CREATE OR REPLACE FUNCTION lag(a anyelement, offset int, default defval > a%type) ... > "defval argname%type" means for caller's expression "CAST(defval AS > typeof(argname))" I continue to feel that the spec's definition of this is not so obviously right that we should jump through hoops to duplicate it. In fact, I don't even agree that we need a disclaimer in the docs saying that it's not quite the same. Only the most nitpicky language lawyers would ever even notice. In hopes of moving this along a bit, I experimented with converting the other functions I listed to use anycompatible. I soon found that touching any functions/operators that are listed in operator classes would open a can of worms far larger than I'd previously supposed. To maintain consistency, we'd have to propagate the datatype changes to *every* function/operator associated with those opfamilies --- many of which only have one any-foo input and thus aren't on my original list. (An example here is that extending btree array_ops will require changing the max(anyarray) and min(anyarray) aggregates too.) We'd then end up with a situation that would be rather confusing from a user's standpoint, in that it'd be quite un-obvious why some array functions use anyarray while other ones use anycompatiblearray. So I backed off to just changing the functions/operators that have no opclass connections, such as array_cat. Even that has some downsides --- for example, in the attached patch, it's necessary to change some polymorphism.sql examples that explicitly reference array_cat(anyarray). I wonder whether this change would break a significant number of user-defined aggregates or operators. (Note that I think we'd have to resist the temptation to fix that by letting CREATE AGGREGATE et al accept support functions that take anyarray/anycompatiblearray (etc) interchangeably. A lot of the security analysis that went into CVE-2020-14350 depended on the assumption that utility commands only do exact lookups of support functions. If we tried to be lax about this, we'd re-introduce the possibility for hostile capture of function references in extension scripts.) Another interesting issue, not seen in the attached but which came up while I was experimenting with the more aggressive patch, was this failure in the polymorphism test: select max(histogram_bounds) from pg_stats where tablename = 'pg_am'; -ERROR: cannot compare arrays of different element types +ERROR: function max(anyarray) does not exist That's because we don't accept pg_stats.histogram_bounds (of declared type anyarray) as a valid input for anycompatiblearray. I wonder if that isn't a bug we need to fix in the anycompatible patch, independently of anything else. We'd not be able to deduce an actual element type from such an input, but we already cannot do so when we match it to an anyarray parameter. Anyway, attached find 0001 - updates Vik's original patch to HEAD and tweaks the grammar in the docs a bit. 0002 - add-on patch to convert array_append, array_prepend, array_cat, array_position, array_positions, array_remove, array_replace, and width_bucket to use anycompatiblearray. I think 0001 is committable, but 0002 is just WIP since I didn't touch the docs. I'm slightly discouraged about whether 0002 is worth proceeding with. Any thoughts? regards, tom lane diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 461b748d89..14cb8e2ce6 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -19621,17 +19621,17 @@ SELECT count(*) FROM sometable; lag -lag ( value anyelement +lag ( value anycompatible , offset integer - , default anyelement ) -anyelement + , default anycompatible ) +anycompatible Returns value evaluated at the row that is offset rows before the current row within the partition; if there is no such row, instead returns default -(which must be of the same type as +(which must be of a type compatible with value). Both offset and default are evaluated @@ -19646,17 +19646,17 @@ SELECT count(*) FROM sometable; lead -lead ( value anyelement +lead ( value anycompatible
RE: Global snapshots
Hi Andrey-san, all, From: Andrey V. Lepikhov > On 7/27/20 11:22 AM, tsunakawa.ta...@fujitsu.com wrote: > > Could you take a look at this patent? I'm afraid this is the Clock-SI for > > MVCC. > Microsoft holds this until 2031. I couldn't find this with the keyword > "Clock-SI."" > > > > > > US8356007B2 - Distributed transaction management for database systems > with multiversioning - Google Patents > > https://patents.google.com/patent/US8356007 > > > > > > If it is, can we circumvent this patent? > I haven't seen this patent before. This should be carefully studied. I contacted 6 people individually, 3 holders of the patent and different 3 authors of the Clock-SI paper. I got replies from two people. (It's a regret I couldn't get a reply from the main author of Clock-SI paper.) [Reply from the patent holder Per-Ake Larson] -- Thanks for your interest in my patent. The answer to your question is: No, Clock-SI is not based on the patent - it was an entirely independent development. The two approaches are similar in the sense that there is no global clock, the commit time of a distributed transaction is the same in every partition where it modified data, and a transaction gets it snapshot timestamp from a local clock. The difference is whether a distributed transaction gets its commit timestamp before or after the prepare phase in 2PC. Hope this helpful. Best regards, Per-Ake -- [Reply from the Clock-SI author Willy Zwaenepoel] -- Thank you for your kind words about our work. I was unaware of this patent at the time I wrote the paper. The two came out more or less at the same time. I am not a lawyer, so I cannot tell you if something based on Clock-SI would infringe on the Microsoft patent. The main distinction to me seems to be that Clock-SI is based on physical clocks, while the Microsoft patent talks about logical clocks, but again I am not a lawyer. Best regards, Willy. -- Does this make sense from your viewpoint, and can we think that we can use Clock-SI without infrindging on the patent? According to the patent holder, the difference between Clock-SI and the patent seems to be fewer than the similarities. Regards Takayuki Tsunakawa
Re: Index Skip Scan (new UniqueKeys)
On Sat, Aug 15, 2020 at 7:09 AM Dmitry Dolgov <9erthali...@gmail.com> wrote: > Here is a new version that hopefully address most of the concerns > mentioned in this thread so far. As before, first two patches are taken > from UniqueKeys thread and attached only for the reference. List of > changes includes: Some thoughts on this version of the patch series (I'm focussing on v36-0005-Btree-implementation-of-skipping.patch again): * I see the following compiler warning: /code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c: In function ‘populate_baserel_uniquekeys’: /code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c:797:13: warning: ‘expr’ may be used uninitialized in this function [-Wmaybe-uninitialized] 797 | else if (!list_member(unique_index->rel->reltarget->exprs, expr)) | ^~ * Perhaps the warning is related to this nearby code that I noticed Valgrind complains about: ==1083468== VALGRINDERROR-BEGIN ==1083468== Invalid read of size 4 ==1083468==at 0x59568A: get_exprs_from_uniqueindex (uniquekeys.c:771) ==1083468==by 0x593C5B: populate_baserel_uniquekeys (uniquekeys.c:140) ==1083468==by 0x56AEA5: set_plain_rel_size (allpaths.c:586) ==1083468==by 0x56AADB: set_rel_size (allpaths.c:412) ==1083468==by 0x56A8CD: set_base_rel_sizes (allpaths.c:323) ==1083468==by 0x56A5A7: make_one_rel (allpaths.c:185) ==1083468==by 0x5AB426: query_planner (planmain.c:269) ==1083468==by 0x5AF02C: grouping_planner (planner.c:2058) ==1083468==by 0x5AD202: subquery_planner (planner.c:1015) ==1083468==by 0x5ABABF: standard_planner (planner.c:405) ==1083468==by 0x5AB7F8: planner (planner.c:275) ==1083468==by 0x6E6F84: pg_plan_query (postgres.c:875) ==1083468==by 0x6E70C4: pg_plan_queries (postgres.c:966) ==1083468==by 0x6E7497: exec_simple_query (postgres.c:1158) ==1083468==by 0x6EBCD3: PostgresMain (postgres.c:4309) ==1083468==by 0x624284: BackendRun (postmaster.c:4541) ==1083468==by 0x623995: BackendStartup (postmaster.c:4225) ==1083468==by 0x61FB70: ServerLoop (postmaster.c:1742) ==1083468==by 0x61F309: PostmasterMain (postmaster.c:1415) ==1083468==by 0x514AF2: main (main.c:209) ==1083468== Address 0x75f13e0 is 4,448 bytes inside a block of size 8,192 alloc'd ==1083468==at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) ==1083468==by 0x8C15C8: AllocSetAlloc (aset.c:919) ==1083468==by 0x8CEA52: palloc (mcxt.c:964) ==1083468==by 0x267F25: systable_beginscan (genam.c:373) ==1083468==by 0x8682CE: SearchCatCacheMiss (catcache.c:1359) ==1083468==by 0x868167: SearchCatCacheInternal (catcache.c:1299) ==1083468==by 0x867E2C: SearchCatCache1 (catcache.c:1167) ==1083468==by 0x8860B2: SearchSysCache1 (syscache.c:1123) ==1083468==by 0x8BD482: check_enable_rls (rls.c:66) ==1083468==by 0x68A113: get_row_security_policies (rowsecurity.c:134) ==1083468==by 0x683C2C: fireRIRrules (rewriteHandler.c:2045) ==1083468==by 0x687340: QueryRewrite (rewriteHandler.c:3962) ==1083468==by 0x6E6EB1: pg_rewrite_query (postgres.c:784) ==1083468==by 0x6E6D23: pg_analyze_and_rewrite (postgres.c:700) ==1083468==by 0x6E7476: exec_simple_query (postgres.c:1155) ==1083468==by 0x6EBCD3: PostgresMain (postgres.c:4309) ==1083468==by 0x624284: BackendRun (postmaster.c:4541) ==1083468==by 0x623995: BackendStartup (postmaster.c:4225) ==1083468==by 0x61FB70: ServerLoop (postmaster.c:1742) ==1083468==by 0x61F309: PostmasterMain (postmaster.c:1415) ==1083468== ==1083468== VALGRINDERROR-END (You'll see the same error if you run Postgres Valgrind + "make installcheck", though I don't think that the queries in question are tests that you yourself wrote.) * IndexScanDescData.xs_itup comments could stand to be updated here -- IndexScanDescData.xs_want_itup is no longer just about index-only scans. * Do we really need the AM-level boolean flag/argument named "scanstart"? Why not just follow the example of btgettuple(), which determines whether or not the scan has been initialized based on the current scan position? Just because you set so->currPos.buf to InvalidBuffer doesn't mean you cannot or should not take the same approach as btgettuple(). And even if you can't take exactly the same approach, I would still think that the scan's opaque B-Tree state should remember if it's the first call to _bt_skip() (rather than some subsequent call) in some other way (e.g. carrying a "scanstart" bool flag directly). A part of my objection to "scanstart" is that it seems to require that much of the code within _bt_skip() get another level of indentation...which makes it even more difficult to follow. * I don't understand what _bt_scankey_within_page() comments mean when they refer to "the page highkey". It looks like this function examines the highest data
RE: Transactions involving multiple postgres foreign servers, take 2
From: Masahiko Sawada > Yes, but it still seems hard to me that we require for all FDW > implementations to commit/rollback prepared transactions without the > possibility of ERROR. Of course we can't eliminate the possibility of error, because remote servers require network communication. What I'm saying is to just require the FDW to return error like xa_commit(), not throwing control away with ereport(ERROR). I don't think it's too strict. > I think it's not necessarily that all FDW implementations need to be > able to support xa_complete(). We can support both synchronous and > asynchronous executions of prepare/commit/rollback. Yes, I think parallel prepare and commit can be an option for FDW. But I don't think it's an option for a serious scale-out DBMS. If we want to use FDW as part of PostgreSQL's scale-out infrastructure, we should design (if not implemented in the first version) how the parallelism can be realized. That design is also necessary because it could affect the FDW API. > If you're concerned that executing a UDF function by like 'SELECT > myfunc();' updates data on a foreign server, since the UDF should know > which foreign server it modifies data on it should be able to register > the foreign server and mark as modified. Or you’re concerned that a > UDF function in WHERE condition is pushed down and updates data (e.g., > ‘SELECT … FROM foreign_tbl WHERE id = myfunc()’)? What I had in mind is "SELECT myfunc(...) FROM mytable WHERE col = ...;" Does the UDF call get pushed down to the foreign server in this case? If not now, could it be pushed down in the future? If it could be, it's worth considering how to detect the remote update now. Regards Takayuki Tsunakawa
Load TIME fields - proposed performance improvement
Hi Hackers. I have a test table with multiple (10) columns defined as TIME WITHOUT TIME ZONE. When loading this table with a lot of data (e.g. "COPY tbl FROM /my/path/2GB.csv WITH (FORMAT CSV)") I observed it was spending an excessive amount of time within the function GetCurrentDateTime. IIUC the code is calling GetCurrentDateTime only to acquire the current TX timestamp as a struct pg_tm in order to derive some timezone information. My test table has 10 x TIME columns. My test data has 22.5 million rows (~ 2GB) So that's 225 million times the GetCurrentDateTime function is called to populate the struct with the same values. I have attached a patch which caches this struct, so now those 225 million calls are reduced to just 1 call. ~ Test Results: Copy 22.5 million rows data (~ 2GB) BEFORE Run 1 = 4m 36s Run 2 = 4m 30s Run 3 = 4m 32s perf showed 20.95% time in GetCurrentDateTime AFTER (cached struct) Run 1 = 3m 44s Run 2 = 3m 44s Run 3 = 3m 45s perf shows no time in GetCurrentDateTime ~17% performance improvement in my environment. YMMV. ~ Thoughts? Kind Regards Peter Smith. Fujitsu Australia. DecodeTimeOnly_GetCurrentDateTime.patch Description: Binary data
Re: Load TIME fields - proposed performance improvement
Peter Smith writes: > IIUC the code is calling GetCurrentDateTime only to acquire the > current TX timestamp as a struct pg_tm in order to derive some > timezone information. > ... > I have attached a patch which caches this struct, so now those 225 > million calls are reduced to just 1 call. Interesting idea, but this implementation is leaving a *lot* on the table. If we want to cache the result of timestamp2tm applied to GetCurrentTransactionStartTimestamp(), there are half a dozen different call sites that could make use of such a cache, eg, GetSQLCurrentDate and GetSQLCurrentTime. Applying the caching only in one indirect caller of that seems pretty narrow-minded. I'm also strongly suspecting that this implementation is outright broken, since it's trying to make DecodeTimeOnly's local variable "tt" into cache storage ... but that variable could be overwritten with other values, during calls that take the other code path there. The cache ought to be inside GetCurrentDateTime or something it calls, and the value needs to be copied to the given output variable. regards, tom lane
Re: Load TIME fields - proposed performance improvement
I wrote: > Interesting idea, but this implementation is leaving a *lot* > on the table. If we want to cache the result of > timestamp2tm applied to GetCurrentTransactionStartTimestamp(), > there are half a dozen different call sites that could make > use of such a cache, eg, GetSQLCurrentDate and GetSQLCurrentTime. As an example, I did some quick-and-dirty "perf" measurement of this case: create table t1 (id int, d date default current_date); insert into t1 select generate_series(1,1); and found that about 10% of the runtime is spent inside timestamp2tm(). Essentially all of that cost could be removed by a suitable caching patch. Admittedly, this is a pretty well cherry-picked example, and more realistic test scenarios might see just a percent or two win. Still, for the size of the patch I'm envisioning, it'd be well worth the trouble. regards, tom lane
Re: shared-memory based stats collector
Hi, On 2020-09-08 17:55:57 +0900, Kyotaro Horiguchi wrote: > Locks on the shared statistics is acquired by the units of such like > tables, functions so the expected chance of collision are not so high. I can't really parse that... > Furthermore, until 1 second has elapsed since the last flushing to > shared stats, lock failure postpones stats flushing so that lock > contention doesn't slow down transactions. I think I commented on that before, but to me 1s seems way too low to switch to blocking lock acquisition. What's the reason for such a low limit? > > /* > - * Clean up any dead statistics collector entries for this DB. We always > + * Clean up any dead activity statistics entries for this DB. We always >* want to do this exactly once per DB-processing cycle, even if we find >* nothing worth vacuuming in the database. >*/ What is "activity statistics"? > @@ -2816,8 +2774,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map, > } > > /* fetch the pgstat table entry */ > - tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared, > - > shared, dbentry); > + tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared, > + >relid); Why do all of these places deal with a snapshot? For most it seems to make much more sense to just look up the entry and then copy that into local memory? There may be some place that need some sort of snapshot behaviour that's stable until commit / pgstat_clear_snapshot(). But I can't reallly see many? > +#define PGSTAT_MIN_INTERVAL 1000/* Minimum interval of > stats data > > +#define PGSTAT_MAX_INTERVAL 1 /* Longest interval of > stats data > + > * updates */ These don't really seem to be in line with the commit message... > /* > - * Structures in which backends store per-table info that's waiting to be > - * sent to the collector. > + * Enums and types to define shared statistics structure. > + * > + * Statistics entries for each object is stored in individual DSA-allocated > + * memory. Every entry is pointed from the dshash pgStatSharedHash via > + * dsa_pointer. The structure makes object-stats entries not moved by dshash > + * resizing, and allows the dshash can release lock sooner on stats > + * updates. Also it reduces interfering among write-locks on each stat entry > by > + * not relying on partition lock of dshash. PgStatLocalHashEntry is the > + * local-stats equivalent of PgStatHashEntry for shared stat entries. > + * > + * Each stat entry is enveloped by the type PgStatEnvelope, which stores > common > + * attribute of all kind of statistics and a LWLock lock object. > + * > + * Shared stats are stored as: > + * > + * dshash pgStatSharedHash > + *-> PgStatHashEntry (dshash entry) > + * (dsa_pointer)-> PgStatEnvelope (dsa memory block) I don't like 'Envelope' that much. If I understand you correctly that's a common prefix that's used for all types of stat objects, correct? If so, how about just naming it PgStatEntryBase or such? I think it'd also be useful to indicate in the "are stored as" part that PgStatEnvelope is just the common prefix for an allocation. > /* > - * pgStatTabHash entry: map from relation OID to PgStat_TableStatus pointer > + * entry size lookup table of shared statistics entries corresponding to > + * PgStatTypes > */ > -typedef struct TabStatHashEntry > +static size_t pgstat_entsize[] = > +/* Ditto for local statistics entries */ > +static size_t pgstat_localentsize[] = > +{ > + 0, /* > PGSTAT_TYPE_ALL: not an entry */ > + sizeof(PgStat_StatDBEntry), /* PGSTAT_TYPE_DB */ > + sizeof(PgStat_TableStatus), /* PGSTAT_TYPE_TABLE */ > + sizeof(PgStat_BackendFunctionEntry) /* PGSTAT_TYPE_FUNCTION */ > +}; These probably should be const as well. > /* > - * Backends store per-function info that's waiting to be sent to the > collector > - * in this hash table (indexed by function OID). > + * Stats numbers that are waiting for flushing out to shared stats are held > in > + * pgStatLocalHash, > */ > -static HTAB *pgStatFunctions = NULL; > +typedef struct PgStatHashEntry > +{ > + PgStatHashEntryKey key; /* hash key */ > + dsa_pointer env;/* pointer to shared stats > envelope in DSA */ > +}PgStatHashEntry; > + > +/* struct for shared statistics entry pointed from shared hash entry. */ > +typedef struct PgStatEnvelope > +{ > + PgStatTypes type; /* statistics entry type */ > + Oid databaseid; /* databaseid
Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.
On Mon, Sep 21, 2020 at 6:58 PM Amit Kapila wrote: > > On Mon, Sep 21, 2020 at 6:27 PM Ashutosh Sharma wrote: > > > > Thanks Dilip for the patch. AFAIU, the fix looks good. One small comment: > > > > Thanks Ashutosh and Dilip for working on this. I'll look into it in a > day or two. > Just a thought: Should we change the sequence of these #define in src/include/replication/logicalproto.h? #define LOGICALREP_PROTO_MIN_VERSION_NUM 1 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2 #define LOGICALREP_PROTO_VERSION_NUM 1 #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM I would have changed above to something like this: #define LOGICALREP_PROTO_VERSION_NUM 1 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2 #define LOGICALREP_PROTO_MIN_VERSION_NUM LOGICALREP_PROTO_VERSION_NUM #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: Load TIME fields - proposed performance improvement
Hi Tom. Thanks for your feedback. On Tue, Sep 22, 2020 at 12:44 PM Tom Lane wrote: > Still, for the size of the patch I'm envisioning, it'd be well > worth the trouble. The OP patch I gave was just a POC to test the effect and to see if the idea was judged as worthwhile... I will rewrite/fix it based on your suggestions. Kind Regards, Peter Smith. Fujitsu Australia.
Re: Index Skip Scan (new UniqueKeys)
On Mon, Sep 21, 2020 at 5:59 PM Peter Geoghegan wrote: > That's all I have for now. One more thing. I don't think that this should be a bitwise AND: if ((offnum > maxoff) & (so->currPos.nextPage == P_NONE)) { } -- Peter Geoghegan
Re: Parallel Full Hash Join
On Tue, Sep 22, 2020 at 8:49 AM Melanie Plageman wrote: > On Wed, Sep 11, 2019 at 11:23 PM Thomas Munro wrote: >> 1. You could probably make it so that the PHJ_BATCH_SCAN_INNER phase >> in this patch (the scan for unmatched tuples) is executed by only one >> process, using the "detach-and-see-if-you-were-last" trick. Melanie >> proposed that for an equivalent problem in the looping hash join. I >> think it probably works, but it gives up a lot of parallelism and thus >> won't scale as nicely as the attached patch. > > I have attached a patch which implements this > (v1-0001-Parallel-FOJ-ROJ-single-worker-scan-buckets.patch). Hi Melanie, Thanks for working on this! I have a feeling this is going to be much easier to land than the mighty hash loop patch. And it's good to get one of our blocking design questions nailed down for both patches. I took it for a very quick spin and saw simple cases working nicely, but TPC-DS queries 51 and 97 (which contain full joins) couldn't be convinced to use it. Hmm. > For starters, in order to support parallel FOJ and ROJ, I re-enabled > setting the match bit for the tuples in the hashtable which > 3e4818e9dd5be294d97c disabled. I did so using the code suggested in [1], > reading the match bit to see if it is already set before setting it. Cool. I'm quite keen to add a "fill_inner" parameter for ExecHashJoinImpl() and have an N-dimensional lookup table of ExecHashJoin variants, so that this and much other related branching can be constant-folded out of existence by the compiler in common cases, which is why I think this is all fine, but that's for another day... > Then, workers except for the last worker detach after exhausting the > outer side of a batch, leaving one worker to proceed to HJ_FILL_INNER > and do the scan of the hash table and emit unmatched inner tuples. +1 Doing better is pretty complicated within our current execution model, and I think this is a good compromise for now. Costing for uneven distribution is tricky; depending on your plan shape, specifically whether there is something else to do afterwards to pick up the slack, it might or might not affect the total run time of the query. It seems like there's not much we can do about that. > I have also attached a variant on this patch which I am proposing to > replace it (v1-0001-Parallel-FOJ-ROJ-single-worker-scan-chunks.patch) > which has a new ExecParallelScanHashTableForUnmatched() in which the > single worker doing the unmatched scan scans one HashMemoryChunk at a > time and then frees them as it goes. I thought this might perform better > than the version which uses the buckets because 1) it should do a bit > less pointer chasing and 2) it frees each chunk of the hash table as it > scans it which (maybe) would save a bit of time during > ExecHashTableDetachBatch() when it goes through and frees the hash > table, but, my preliminary tests showed a negligible difference between > this and the version using buckets. I will do a bit more testing, > though. +1 I agree that it's the better of those two options. >> [stuff about deadlocks] > > The leader exclusion tactics and the spooling idea don't solve the > execution order deadlock possibility, so, this "all except last detach > and last does unmatched inner scan" seems like the best way to solve > both types of deadlock. Agreed (at least as long as our threads of query execution are made out of C call stacks and OS processes that block). >> Some other notes on the patch: >> >> Aside from the deadlock problem, there are some minor details to tidy >> up (handling of late starters probably not quite right, rescans not >> yet considered). > > These would not be an issue with only one worker doing the scan but > would have to be handled in a potential new parallel-enabled solution > like I suggested above. Makes sense. Not sure why I thought anything special was needed for rescans. >> There is a fun hard-coded parameter that controls >> the parallel step size in terms of cache lines for the unmatched scan; >> I found that 8 was a lot faster than 4, but no slower than 128 on my >> laptop, so I set it to 8. > > I didn't add this cache line optimization to my chunk scanning method. I > could do so. Do you think it is more relevant, less relevant, or the > same if only one worker is doing the unmatched inner scan? Yeah it's irrelevant for a single process, and even more irrelevant if we go with your chunk-based version. >> More thoughts along those micro-optimistic >> lines: instead of match bit in the header, you could tag the pointer >> and sometimes avoid having to follow it, and you could prefetch next >> non-matching tuple's cacheline by looking a head a bit. > > I would be happy to try doing this once we get the rest of the patch > ironed out so that seeing how much of a performance difference it makes > is more straightforward. Ignore that, I have no idea if the maintenance overhead for such an every-tuple-in-this-chain-is-ma
Parallel INSERT (INTO ... SELECT ...)
Hi Hackers, Following on from Dilip Kumar's POC patch for allowing parallelism of the SELECT part of "INSERT INTO ... SELECT ...", I have attached a POC patch for allowing parallelism of both the INSERT and SELECT parts, where it can be allowed. For cases where it can't be allowed (e.g. INSERT into a table with foreign keys, or INSERT INTO ... SELECT ... ON CONFLICT ... DO UPDATE ...") it at least allows parallelism of the SELECT part. Obviously I've had to update the planner and executor and parallel-worker code to make this happen, hopefully not breaking too many things along the way. Examples with patch applied: (1) non-parallel: test=# explain analyze insert into primary_tbl select * from third_tbl; QUERY PLAN -- Insert on primary_tbl (cost=0.00..154.99 rows= width=12) (actual time=108.445..108.446 rows=0 loops=1) -> Seq Scan on third_tbl (cost=0.00..154.99 rows= width=12) (actual time=0.009..5.282 rows= loops=1) Planning Time: 0.132 ms Execution Time: 108.596 ms (4 rows) (2) parallel: test=# explain analyze insert into primary_tbl select * from third_tbl; QUERY PLAN Gather (cost=0.00..16.00 rows= width=12) (actual time=69.870..70.310 rows=0 loops=1) Workers Planned: 5 Workers Launched: 5 -> Parallel Insert on primary_tbl (cost=0.00..16.00 rows=500 width=12) (actual time=59.948..59.949 rows=0 loops=6) -> Parallel Seq Scan on third_tbl (cost=0.00..80.00 rows=2500 width=12) (actual time=0.014..0.922 rows=1666 loops=6) Planning Time: 0.121 ms Execution Time: 70.438 ms (7 rows) (3) parallel select only (insert into table with foreign key) test=# explain analyze insert into secondary_tbl select * from third_tbl; QUERY PLAN Insert on secondary_tbl (cost=0.00..80.00 rows= width=12) (actual time=33.864..33.926 rows=0 loops=1) -> Gather (cost=0.00..80.00 rows= width=12) (actual time=0.451..5.201 rows= loops=1) Workers Planned: 4 Workers Launched: 4 -> Parallel Seq Scan on third_tbl (cost=0.00..80.00 rows=2500 width=12) (actual time=0.013..0.717 rows=2000 loops=5) Planning Time: 0.127 ms Trigger for constraint secondary_tbl_index_fkey: time=331.834 calls= Execution Time: 367.342 ms (8 rows) Known issues/TODOs: - Currently only for "INSERT INTO ... SELECT ...". To support "INSERT INTO ... VALUES ..." would need additional Table AM functions for dividing up the INSERT work amongst the workers (currently only exists for scans). - When INSERTs are made parallel, currently the reported row-count in the "INSERT 0 " status only reflects the rows that the leader has processed (not the workers) - so it is obviously less than the actual number of rows inserted. - Functions relating to computing the number of parallel workers for an INSERT, and the cost of an INSERT, need work. - "force_parallel_mode" handling was updated so that it only affects SELECT (not INSERT) - can't allow it for INSERT because we're only supporting "INSERT INTO .. SELECT ..." and don't support other types of INSERTs, and also can't allow attempted parallel UPDATEs resulting from "INSERT INTO ... SELECT ... ON CONFLICT ... DO UPDATE" etc. Thoughts and feedback? Regards, Greg Nancarrow Fujitsu Australia 0001-ParallelInsertSelect.patch Description: Binary data
Use appendStringInfoString and appendPQExpBufferStr where possible
Hi hackers In(/src/bin/scripts/reindexdb.c; /src/backend/access/rmgrdesc/dbasedesc.c; /src/pl/plpython/plpy_elog.c) I found some more places that should use appendPQExrBufferStr instead of appendPQExpBuffer. Here is the patch. Previous Discussion: https://www.postgresql.org/message-id/CAKJS1f9P=M-3ULmPvr8iCno8yvfDViHibJjpriHU8+SXUgeZ=w...@mail.gmail.com Best regards, Houzj/huangj 0001-appendPQExpBufferStr-instead-of-appendPQExpBuffer.patch Description: 0001-appendPQExpBufferStr-instead-of-appendPQExpBuffer.patch 0001-Use-appendStringInfoString-instead-of-appendStringIn.patch Description: 0001-Use-appendStringInfoString-instead-of-appendStringIn.patch
Re: Dynamic gathering the values for seq_page_cost/xxx_cost
Thanks Ashutosh for coming:) On Mon, Sep 21, 2020 at 9:03 PM Ashutosh Bapat wrote: > On Mon, Sep 21, 2020 at 9:11 AM Andy Fan wrote: > > > > Here are some changes for my detection program. > > > > | | seq_read_lat (us) | > random_read_lat (us) | > > | FIO |12 | > 148 | > > | Previous main.c | 8.5 | > 74 | > > | invalidate_device_cache before each testing | 9 | > 150 | > > | prepare the test data file with O_DIRECT option |15 | > 150 | > > > > In invalidate_device_cache, I just create another 1GB data file and read > > it. (see invalidate_device_cache function) this is similar as the > previous fio setup. > > > > prepare test data file with O_DIRECT option means in the past, I prepare > the test > > file with buffer IO. and before testing, I do invalidate device cache, > file > > system cache. but the buffered prepared file still get better > performance, I > > have no idea of it. Since I don't want any cache. I use O_DIRECT > > option at last. The seq_read_lat changed from 9us to 15us. > > I still can't find out the 25% difference with the FIO result. (12 us vs > 9 us). > > > > At last, the random_page_cost happens to not change very much. > > > > /u/y/g/fdirect> sudo ./main > > fs_cache_lat = 0.569031us, seq_read_lat = 18.901749us, random_page_lat = > 148.650589us > > > > cache hit ratio: 1.00 random_page_cost 1.00 > > cache hit ratio: 0.90 random_page_cost 6.401019 > > cache hit ratio: 0.50 random_page_cost 7.663772 > > cache hit ratio: 0.10 random_page_cost 7.841498 > > cache hit ratio: 0.00 random_page_cost 7.864383 > > > > This result looks much different from "we should use 1.1 ~ 1.5 for SSD". > > > > Very interesting. Thanks for working on this. In an earlier email you > mentioned that TPCH plans changed to efficient ones when you changed > random_page_cost = =8.6 from 4 and seq_page_cost was set to 1. IIUC, > setting random_page_cost to seq_page_cost to the same ratio as that > between the corresponding latencies improved the plans. Yes. How about > trying this with that ratio set to the one obtained from the latencies > provided by FIO? Do we see any better plans? > My tools set the random_page_cost to 8.6, but based on the fio data, it should be set to 12.3 on the same hardware. and I do see the better plan as well with 12.3. Looks too smooth to believe it is true.. The attached result_fio_mytool.tar.gz is my test result. You can use git show HEAD^^ is the original plan with 8.6. git show HEAD^ show the plan changes after we changed the random_page_cost. git show HEAD shows the run time statistics changes for these queries. I also uploaded the test tool [1] for this for your double check. | | 8.6 | 12.3 | |-+--+--| | Q2 | 2557.064 | 2444.995 | | Q4 | 3544.606 | 3148.884 | | Q7 | 2965.820 | 2240.185 | | Q14 | 4988.747 | 4931.375 | > > page cost is one thing, but there are CPU costs also involved in costs > of expression evaluation. Should those be changed accordingly to the > CPU latency? > > Yes, we need that as well. At the beginning of this thread, I treat all of them equally. In the first reply of Tomas, he mentioned random_page_cost specially. After ~10 months, I tested TPCH on a hardware and then found random_page_cost is set too incorrectly, after fixing it, the result looks much better. So I'd like to work on this special thing first. [1] https://github.com/zhihuiFan/tpch-postgres/blob/master/random_page_cost.sh -- Best Regards Andy Fan result_fio_mytool.tar.gz Description: GNU Zip compressed data
Re: Dynamic gathering the values for seq_page_cost/xxx_cost
> > > It's probably worth testing on various other storage systems to see >> how that applies to those. >> >> Yes, I can test more on new hardware once I get it. Now it is still in > progress. > However I can only get a physical machine with SSD or Virtual machine with > SSD, other types are hard for me right now. > > Here is a result on a different hardware. The test method is still not changed.[1] Hardware Info: Virtual Machine with 61GB memory. Linux Kernel: 5.4.0-31-generic Ubuntu # lshw -short -C disk H/W pathDevice Class Description = /0/100/4/0 /dev/vda disk 42GB Virtual I/O device /0/100/5/0 /dev/vdb disk 42GB Virtual I/O device The disk on the physical machine is claimed as SSD. This time the FIO and my tools can generate the exact same result. fs_cache_lat = 0.957756us, seq_read_lat = 70.780327us, random_page_lat = 438.837257us cache hit ratio: 1.00 random_page_cost 1.00 cache hit ratio: 0.90 random_page_cost 5.635470 cache hit ratio: 0.50 random_page_cost 6.130565 cache hit ratio: 0.10 random_page_cost 6.192183 cache hit ratio: 0.00 random_page_cost 6.199989 | | seq_read_lat(us) | random_read_lat(us) | | FIO | 70 | 437 | | MY Tool | 70 | 438 | The following query plans have changed because we change random_page_cost to 4 to 6.2, the Execution time also changed. | | random_page_cost=4 | random_page_cost=6.2 | |-++--| | Q1 | 2561 | 2528.272 | | Q10 | 4675.749 | 4684.225 | | Q13 | 18858.048 |18565.929 | | Q2 |329.279 | 308.723 | | Q5 | 46248.132 | 7900.173 | | Q6 | 52526.462 |47639.503 | | Q7 | 27348.900 |25829.221 | Q5 improved by 5.8 times and Q6 & Q7 improved by ~10%. [1] https://www.postgresql.org/message-id/CAKU4AWpRv50k8E3tC3tiLWGe2DbKaoZricRh_YJ8y_zK%2BHdSjQ%40mail.gmail.com -- Best Regards Andy Fan
Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.
On Tue, Sep 22, 2020 at 8:34 AM Ashutosh Sharma wrote: > > On Mon, Sep 21, 2020 at 6:58 PM Amit Kapila wrote: > > > > On Mon, Sep 21, 2020 at 6:27 PM Ashutosh Sharma > > wrote: > > > > > > Thanks Dilip for the patch. AFAIU, the fix looks good. One small comment: > > > > > > > Thanks Ashutosh and Dilip for working on this. I'll look into it in a > > day or two. > > > > Just a thought: > > Should we change the sequence of these #define in > src/include/replication/logicalproto.h? > > #define LOGICALREP_PROTO_MIN_VERSION_NUM 1 > #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2 > #define LOGICALREP_PROTO_VERSION_NUM 1 > #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM > > I would have changed above to something like this: > > #define LOGICALREP_PROTO_VERSION_NUM 1 > #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2 > > #define LOGICALREP_PROTO_MIN_VERSION_NUM LOGICALREP_PROTO_VERSION_NUM > #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM > I am not sure if this suggestion makes it better than what is purposed by Dilip but I think we can declare them in define number order like below: #define LOGICALREP_PROTO_MIN_VERSION_NUM 1 #define LOGICALREP_PROTO_VERSION_NUM 1 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2 #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM Another thing is can we also test by having a publisher of v14 and subscriber of v13 or prior version, just reverse of what Ashutosh has tested? -- With Regards, Amit Kapila.
Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.
On Tue, Sep 22, 2020 at 12:02 PM Amit Kapila wrote: > > On Tue, Sep 22, 2020 at 8:34 AM Ashutosh Sharma wrote: > > > > On Mon, Sep 21, 2020 at 6:58 PM Amit Kapila wrote: > > > > > > On Mon, Sep 21, 2020 at 6:27 PM Ashutosh Sharma > > > wrote: > > > > > > > > Thanks Dilip for the patch. AFAIU, the fix looks good. One small > > > > comment: > > > > > > > > > > Thanks Ashutosh and Dilip for working on this. I'll look into it in a > > > day or two. > > > > > > > Just a thought: > > > > Should we change the sequence of these #define in > > src/include/replication/logicalproto.h? > > > > #define LOGICALREP_PROTO_MIN_VERSION_NUM 1 > > #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2 > > #define LOGICALREP_PROTO_VERSION_NUM 1 > > #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM > > > > I would have changed above to something like this: > > > > #define LOGICALREP_PROTO_VERSION_NUM 1 > > #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2 > > > > #define LOGICALREP_PROTO_MIN_VERSION_NUM LOGICALREP_PROTO_VERSION_NUM > > #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM > > > > I am not sure if this suggestion makes it better than what is purposed > by Dilip but I think we can declare them in define number order like > below: > #define LOGICALREP_PROTO_MIN_VERSION_NUM 1 > #define LOGICALREP_PROTO_VERSION_NUM 1 > #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2 > #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM > The only reason I proposed that was because for the *_MAX_VERSION_NUM we are using the latest PROTOCOL version name in its definition so why not to do the same for defining *_MIN_VERSION_NUM as well. Other than that, I also wanted to correct the sequence so that they are defined in the increasing order which you have already done here. > Another thing is can we also test by having a publisher of v14 and > subscriber of v13 or prior version, just reverse of what Ashutosh has > tested? > I've tested LR from PGv12 (Publisher) to PGv14 (Subscriber) and it works fine. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.
On Tue, Sep 22, 2020 at 12:22 PM Ashutosh Sharma wrote: > > On Tue, Sep 22, 2020 at 12:02 PM Amit Kapila wrote: > > > > On Tue, Sep 22, 2020 at 8:34 AM Ashutosh Sharma > > wrote: > > > > > > On Mon, Sep 21, 2020 at 6:58 PM Amit Kapila > > > wrote: > > > > > > > > On Mon, Sep 21, 2020 at 6:27 PM Ashutosh Sharma > > > > wrote: > > > > > > > > > > Thanks Dilip for the patch. AFAIU, the fix looks good. One small > > > > > comment: > > > > > > > > > > > > > Thanks Ashutosh and Dilip for working on this. I'll look into it in a > > > > day or two. > > > > > > > > > > Just a thought: > > > > > > Should we change the sequence of these #define in > > > src/include/replication/logicalproto.h? > > > > > > #define LOGICALREP_PROTO_MIN_VERSION_NUM 1 > > > #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2 > > > #define LOGICALREP_PROTO_VERSION_NUM 1 > > > #define LOGICALREP_PROTO_MAX_VERSION_NUM > > > LOGICALREP_PROTO_STREAM_VERSION_NUM > > > > > > I would have changed above to something like this: > > > > > > #define LOGICALREP_PROTO_VERSION_NUM 1 > > > #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2 > > > > > > #define LOGICALREP_PROTO_MIN_VERSION_NUM LOGICALREP_PROTO_VERSION_NUM > > > #define LOGICALREP_PROTO_MAX_VERSION_NUM > > > LOGICALREP_PROTO_STREAM_VERSION_NUM > > > > > > > I am not sure if this suggestion makes it better than what is purposed > > by Dilip but I think we can declare them in define number order like > > below: > > #define LOGICALREP_PROTO_MIN_VERSION_NUM 1 > > #define LOGICALREP_PROTO_VERSION_NUM 1 > > #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2 > > #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM > > > > The only reason I proposed that was because for the *_MAX_VERSION_NUM > we are using the latest PROTOCOL version name in its definition so why > not to do the same for defining *_MIN_VERSION_NUM as well. Other than > that, I also wanted to correct the sequence so that they are defined > in the increasing order which you have already done here. > > > Another thing is can we also test by having a publisher of v14 and > > subscriber of v13 or prior version, just reverse of what Ashutosh has > > tested? > > > > I've tested LR from PGv12 (Publisher) to PGv14 (Subscriber) and it works fine. > I meant LR from PGv14 (Publisher) to PGv12 (Subscriber) not the other way. > -- > With Regards, > Ashutosh Sharma > EnterpriseDB:http://www.enterprisedb.com