Recovery performance of standby for multiple concurrent truncates on large tables
Hello hackers, Recently, the problem on improving performance of multiple drop/truncate tables in a single transaction with large shared_buffers (as shown below) was solved by commit b416691. BEGIN; truncate tbl001; ... truncate tbl050; COMMIT; However, we have a customer that needs to execute multiple concurrent TRUNCATEs (40~50) on different large tables (as shown below) in a shorter amount of time. This one is not covered by the previous commit's improvement. BEGIN; truncate tbl001; COMMIT; ... BEGIN; truncate tbl050; COMMIT; [Problem] Currently, when the standby recovers the WAL of TRUNCATE/DROP TABLE, it leads to separate scans of the whole shared buffer in sequence to check whether or not the table to be deleted is cached in the shared buffer. Moreover, if the size of shared_buffers is large (i.e. 300GB) and the primary server fails during the replay, it would take a long while for the standby to complete recovery. [Idea] Since in the current implementation, the replay of each TRUNCATE/DROP TABLE scans the whole shared buffer. One approach (though idea is not really developed yet) is to improve the recovery by delaying the shared buffer scan and invalidation (DropRelFileNodeBuffers) and to put it after the next checkpoint (after failover completion). The replay of TRUNCATE/DROP TABLE just make the checkpointer process remember what relations should be invalidated in the shared buffer during subsequent checkpoint. The checkpointer then scans the shared buffer only once to invalidate the buffers of relations that was dropped and truncated. However, this is still a rough idea, so I am not sure if it’s feasible. I would like to know if the community has advice or other alternative solutions on how to work around this. Any insights, advice, feedback? Thank you in advance. Regards, Kirk Jamison
Re: Recovery performance of standby for multiple concurrent truncates on large tables
Hi, On 2018-07-10 07:05:12 +, Jamison, Kirk wrote: > Hello hackers, > > Recently, the problem on improving performance of multiple drop/truncate > tables in a single transaction with large shared_buffers (as shown below) was > solved by commit b416691. > BEGIN; > truncate tbl001; > ... > truncate tbl050; > COMMIT; > > However, we have a customer that needs to execute multiple concurrent > TRUNCATEs (40~50) on different large tables (as shown below) in a shorter > amount of time. This one is not covered by the previous commit's improvement. > BEGIN; > truncate tbl001; > COMMIT; > ... > BEGIN; > truncate tbl050; > COMMIT; > > [Problem] > Currently, when the standby recovers the WAL of TRUNCATE/DROP TABLE, it leads > to separate scans of the whole shared buffer in sequence to check whether or > not the table to be deleted is cached in the shared buffer. Moreover, if the > size of shared_buffers is large (i.e. 300GB) and the primary server fails > during the replay, it would take a long while for the standby to complete > recovery. If you do so sequentially on the primary, I'm not clear as to why you think the issue is bigger in recovery? > [Idea] > Since in the current implementation, the replay of each TRUNCATE/DROP TABLE > scans the whole shared buffer. > One approach (though idea is not really developed yet) is to improve the > recovery by delaying the shared buffer scan and invalidation > (DropRelFileNodeBuffers) and to put it after the next checkpoint (after > failover completion). The replay of TRUNCATE/DROP TABLE just make the > checkpointer process remember what relations should be invalidated in the > shared buffer during subsequent checkpoint. The checkpointer then scans the > shared buffer only once to invalidate the buffers of relations that was > dropped and truncated. I think you'd run into a lot of very hairy details with this approach. Consider what happens if client processes need fresh buffers and need to write out a victim buffer. You'll need to know that the relevant buffer is actually invalid. Thus the knowledge about the "delayed" drops would need to be in shared buffers and scanned on every dirty buffer writeout. > However, this is still a rough idea, so I am not sure if it’s feasible. I > would like to know if the community has advice or other alternative solutions > on how to work around this. > Any insights, advice, feedback? I personally think we should rather just work towards a ordered buffer mapping implementation. Greetings, Andres Freund
Re: Non-reserved replication slots and slot advancing
Hi, On 2018-07-10 09:54:28 +0900, Michael Paquier wrote: > > > Also, ERRCODE_FEATURE_NOT_SUPPORTED doesn't quite seem right. How about > > > ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE? > > > > +1 to both of Andres' suggestions. > > Those indeed sound better. What do you think of the attached? Looks generally good. One remark: > + /* A slot whose restart_lsn has never been reserved cannot be advanced > */ > + if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn)) > + { > + ReplicationSlotRelease(); > + ereport(ERROR, > + > (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("cannot advance replication slot that > has not previously reserved WAL"))); > + } Why is the ReplicationSlotRelease() needed here? Souldn't the error handling code do so automatically? Greetings, Andres Freund
Re: Non-reserved replication slots and slot advancing
On Tue, Jul 10, 2018 at 12:26:30AM -0700, Andres Freund wrote: > Why is the ReplicationSlotRelease() needed here? Souldn't the error > handling code do so automatically? Oh, indeed. I didn't know that PostgresMain was doing some cleanup here. There is a second place where this can be removed, which comes from the original commit which added the advance function. An updated version is attached. Do you have other comments? -- Michael diff --git a/contrib/test_decoding/expected/slot.out b/contrib/test_decoding/expected/slot.out index 21e9d56f73..2737a8a301 100644 --- a/contrib/test_decoding/expected/slot.out +++ b/contrib/test_decoding/expected/slot.out @@ -131,3 +131,20 @@ SELECT pg_drop_replication_slot('regression_slot1'); ERROR: replication slot "regression_slot1" does not exist SELECT pg_drop_replication_slot('regression_slot2'); ERROR: replication slot "regression_slot2" does not exist +-- slot advance with physical slot, error with non-reserved slot +SELECT slot_name FROM pg_create_physical_replication_slot('regression_slot3'); +slot_name +-- + regression_slot3 +(1 row) + +SELECT pg_replication_slot_advance('regression_slot3', '0/0'); -- invalid LSN +ERROR: invalid target wal lsn +SELECT pg_replication_slot_advance('regression_slot3', '0/1'); -- error +ERROR: cannot advance replication slot that has not previously reserved WAL +SELECT pg_drop_replication_slot('regression_slot3'); + pg_drop_replication_slot +-- + +(1 row) + diff --git a/contrib/test_decoding/sql/slot.sql b/contrib/test_decoding/sql/slot.sql index 706340c1d8..24cdf7155d 100644 --- a/contrib/test_decoding/sql/slot.sql +++ b/contrib/test_decoding/sql/slot.sql @@ -68,3 +68,9 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot1', 'test_ -- both should error as they should be dropped on error SELECT pg_drop_replication_slot('regression_slot1'); SELECT pg_drop_replication_slot('regression_slot2'); + +-- slot advance with physical slot, error with non-reserved slot +SELECT slot_name FROM pg_create_physical_replication_slot('regression_slot3'); +SELECT pg_replication_slot_advance('regression_slot3', '0/0'); -- invalid LSN +SELECT pg_replication_slot_advance('regression_slot3', '0/1'); -- error +SELECT pg_drop_replication_slot('regression_slot3'); diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index 3ed9021c2f..4851bc2e24 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -9867,7 +9867,8 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx The address (LSN) of oldest WAL which still might be required by the consumer of this slot and thus won't be - automatically removed during checkpoints. + automatically removed during checkpoints. NULL + if the LSN of this slot has never been reserved. diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c index 2806e1076c..c92f5f0789 100644 --- a/src/backend/replication/slotfuncs.c +++ b/src/backend/replication/slotfuncs.c @@ -483,6 +483,12 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS) /* Acquire the slot so we "own" it */ ReplicationSlotAcquire(NameStr(*slotname), true); + /* A slot whose restart_lsn has never been reserved cannot be advanced */ + if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn)) + ereport(ERROR, +(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("cannot advance replication slot that has not previously reserved WAL"))); + /* * Check if the slot is not moving backwards. Physical slots rely simply * on restart_lsn as a minimum point, while logical slots have confirmed @@ -495,14 +501,11 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS) minlsn = MyReplicationSlot->data.restart_lsn; if (moveto < minlsn) - { - ReplicationSlotRelease(); ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("cannot move slot to %X/%X, minimum is %X/%X", + errmsg("cannot advance replication slot to %X/%X, minimum is %X/%X", (uint32) (moveto >> 32), (uint32) moveto, (uint32) (minlsn >> 32), (uint32) minlsn))); - } /* Do the actual slot update, depending on the slot type */ if (OidIsValid(MyReplicationSlot->data.database)) signature.asc Description: PGP signature
RE: hostorder and failover_timeout for libpq
Hello Ildar, I have a question about failover_timeout parameter. Which would be better: implementing the parameter to retry at waiting time or controlling the connection retry on the application side? Also, I have no idea if the amount of random access by hostorder parameter will have a good effect on load balancing. Please let me know if there are examples. I am sorry if these were examined by the previous thread. I haven't read it yet. Regards, Aya Iwata
Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query
On Tue, Jul 10, 2018 at 12:26 AM Fujii Masao wrote: > On Sun, Jul 8, 2018 at 11:48 AM, Haribabu Kommi > wrote: > > > > On Fri, Jul 6, 2018 at 3:22 AM Fujii Masao > wrote: > >> > >> On Wed, Jul 4, 2018 at 7:12 PM, Haribabu Kommi < > kommi.harib...@gmail.com> > >> wrote: > >> > > >> > Update patch attached. > >> > >> + if (userid != 0 && dbid != 0 && queryid != 0) > >> > >> UINT64CONST() should be used for the constant for queryid? > > > > > > OK. > > > >> > >> It's rare case, but 0 can be assigned as valid queryid. Right? > > > > > > But for normal queries, in post parse analyze function, the queryID > > is calculated and it set to 1, in case if the calculation becomes 0. > > But for the utility statements, the calculation is done using the > > pgss_hash_string() function. I am not sure whether this function > > can return 0. > > Though I've not read the whole code of pgss_hash_string(), ISTM that > the function can return 0. Otherwise, the following code is unnecessary > after queryid is assigned by hash_any_extended(), > in pgss_post_parse_analyze(). > > /* > * If we are unlucky enough to get a hash of zero, use 1 instead, to > * prevent confusion with the utility-statement case. > */ > if (query->queryId == UINT64CONST(0)) > query->queryId = UINT64CONST(1); > > > If yes, then we may need same handling to utility statements > > similar like normal statements but with queryID as 2 for utility > statements. > > That's possible, but I think that it's better to get rid of such corner > case at all. > QueryID 2 is used in case if the generated hash value is 0 for utility statements to give difference with normal statements and also used the UINT64CONST macro as per the earlier comment. updated patch attached. This patch needs to be applied on top of the ACL permissions revert patch. ACL revert patch also in this thread. Regards, Haribabu Kommi Fujitsu Australia 0002-pg_stat_statements_reset-to-reset-specific-query-use_v4.patch Description: Binary data 0001-Revoke-pg_stat_statements_reset-permissions_v2.patch Description: Binary data
Re: Non-reserved replication slots and slot advancing
On 2018-07-10 16:59:07 +0900, Michael Paquier wrote: > On Tue, Jul 10, 2018 at 12:26:30AM -0700, Andres Freund wrote: > > Why is the ReplicationSlotRelease() needed here? Souldn't the error > > handling code do so automatically? > > Oh, indeed. I didn't know that PostgresMain was doing some cleanup > here. There is a second place where this can be removed, which comes > from the original commit which added the advance function. Gna, it's almost like the original code wasn't properly reviewed. > An updated version is attached. Do you have other comments? Looks sane, without having tested it. > if (moveto < minlsn) > - { > - ReplicationSlotRelease(); > ereport(ERROR, > (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > - errmsg("cannot move slot to %X/%X, minimum is > %X/%X", > + errmsg("cannot advance replication slot to > %X/%X, minimum is %X/%X", > (uint32) (moveto >> 32), > (uint32) moveto, > (uint32) (minlsn >> 32), > (uint32) minlsn))); > - } If you're touching this, I'd also change the errcode here. Greetings, Andres Freund
[Tiny Debug Issue] Undefined Reference problem encountered during compile
First of all, thank you for listening to this mail. As I mentioned in the title, I got a undefined reference in compiling time. The reason that I wonder is that I surely add header file for the function. In more detail, I want to use TransactionIdEquals() function in freelist.c file. Then, I included a header file, "access/xact.h" for the function. However, in make command, compiler send implicit declaration of function warning and then undefined reference at linking time. How could I overcome this problem? :(
RE: How can we submit code patches that implement our (pending) patents?
From: Nico Williams [mailto:n...@cryptonector.com] > On Sat, Jul 07, 2018 at 10:20:35AM -0700, Andres Freund wrote: > > It's entirely possible to dual license contributions and everything. Why > > are you making such aggressive statements about a, so far, apparently > > good faith engagement? > > One problem is that many contributors would not want to be tainted by > knowledge of the patents in question (since that leads to triple > damages). > > How would you protect contributors and core developers from tainting? IIUC, you are concerned about the possibility that PG developers would read the patent document (not the PG source code), and unconsciously use the patented algorithm for other software that's not related to PostgreSQL. That would only be helped by not reading the patent document... BTW, are you relieved the current PostgreSQL doesn't contain any patented code? As far as I know, PostgreSQL development process doesn't have a step to check patent and copyright infringement. > One possible answer is that you wouldn't. But that might reduce the > size of the community, or lead to a fork. Yes, that's one unfortunate future, which I don't want to happen of course. I believe PostgreSQL should accept patent for further evolution, because PostgreSQL is now a popular, influential software that many organizations want to join. Regards Takayuki Tsunakawa
Re: [Tiny Debug Issue] Undefined Reference problem encountered during compile
Hi, On Tue, Jul 10, 2018 at 10:19 AM, 이재훈 wrote: > > I want to use TransactionIdEquals() function in freelist.c file. > Then, I included a header file, "access/xact.h" for the function. > > However, in make command, compiler send implicit declaration of function > warning and then undefined reference at linking time. > > How could I overcome this problem? :( This macro is actually defined in access/transam.h.
Re: [Tiny Debug Issue] Undefined Reference problem encountered during compile
On Tue, Jul 10, 2018 at 05:19:25PM +0900, 이재훈 wrote: > I want to use TransactionIdEquals() function in freelist.c file. > Then, I included a header file, "access/xact.h" for the function. This is in access/transam.h. -- Michael signature.asc Description: PGP signature
Re: Non-reserved replication slots and slot advancing
On Tue, Jul 10, 2018 at 01:16:30AM -0700, Andres Freund wrote: > On 2018-07-10 16:59:07 +0900, Michael Paquier wrote: > > if (moveto < minlsn) > > - { > > - ReplicationSlotRelease(); > > ereport(ERROR, > > (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > > -errmsg("cannot move slot to %X/%X, minimum is > > %X/%X", > > +errmsg("cannot advance replication slot to > > %X/%X, minimum is %X/%X", > > (uint32) (moveto >> 32), > > (uint32) moveto, > > (uint32) (minlsn >> 32), > > (uint32) minlsn))); > > - } > > If you're touching this, I'd also change the errcode here. Yep, let's change that as well. If you want to look at that stuff more deeply, please feel free. Otherwise I could always push what I have now. -- Michael signature.asc Description: PGP signature
RE: How can we submit code patches that implement our (pending) patents?
From: Markus Wanner [mailto:markus.wan...@2ndquadrant.com] > equally sure there are well intended ones as well. For example, I'd > expect patent pools (including the Open Invention Network, cited by the > OP) to hire non-IANAL personnel who know Legalese well enough to setup > valid contracts (between participating companies). I think I'll consult Open Invention Network on this issue, since I haven't received any reply from SFLC. > I certainly like the (future) patent holder coming forth to offer a > grant a lot better than the one who doesn't (but still holds the > patent). I'm missing the appreciation for that former strategy in this > thread and fear we're setting a precedent for the latter one, instead. Me too. Regards Takayuki Tsunakawa
Re: How can we submit code patches that implement our (pending) patents?
Hi On Tue, Jul 10, 2018 at 9:29 AM, Tsunakawa, Takayuki < tsunakawa.ta...@jp.fujitsu.com> wrote: > From: Markus Wanner [mailto:markus.wan...@2ndquadrant.com] > > equally sure there are well intended ones as well. For example, I'd > > expect patent pools (including the Open Invention Network, cited by the > > OP) to hire non-IANAL personnel who know Legalese well enough to setup > > valid contracts (between participating companies). > > I think I'll consult Open Invention Network on this issue, since I haven't > received any reply from SFLC. > SFLC have acted as the projects counsel in the past, so I'm not surprised they aren't talking to you; you won't be a known contact to them as a PG contributor, and as a Fujitsu employee there would likely be a conflict of interest for them to talk to you. > > > I certainly like the (future) patent holder coming forth to offer a > > grant a lot better than the one who doesn't (but still holds the > > patent). I'm missing the appreciation for that former strategy in this > > thread and fear we're setting a precedent for the latter one, instead. > > Me too. > > > Regards > Takayuki Tsunakawa > > -- Dave Page Blog: http://pgsnake.blogspot.com Twitter: @pgsnake EnterpriseDB UK: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Recovery performance of standby for multiple concurrent truncates on large tables
On Tue, Jul 10, 2018 at 10:05 AM Jamison, Kirk wrote: > Since in the current implementation, the replay of each TRUNCATE/DROP > TABLE scans the whole shared buffer. > > One approach (though idea is not really developed yet) is to improve the > recovery by delaying the shared buffer scan and invalidation > (DropRelFileNodeBuffers) and to put it after the next checkpoint (after > failover completion). The replay of TRUNCATE/DROP TABLE just make the > checkpointer process remember what relations should be invalidated in the > shared buffer during subsequent checkpoint. The checkpointer then scans the > shared buffer only once to invalidate the buffers of relations that was > dropped and truncated. > How about using the background writer for this? It seems to me that the main reason to invalidate buffers would be to free them up for buffer allocation, which is precisely the task of background writer. When adding a filenode to be invalidated, take note of bgwriter position and add it to a queue. When bgwriter is advancing, check each buffer tag against a hash table of filenodes being invalidated. When background writer has completed a loop it can remove the invalidated filenode. When bgwriter falls behind the clock sweep and there are filenodes to invalidate it should run the invalidation scan instead of skipping ahead. If there are already too many filenodes being invalidated, then whoever is trying to add a new one gets to run the invalidation scan until something can be evicted. -- Ants Aasma Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: https://www.cybertec-postgresql.com/
Re: [PATCH] Improve geometric types
> The version number restriction isn't strictly needed. I only > suggested it because it'd match the #if that wraps the code that's > actually using those macros, introduced by commit cec8394b5ccd. That > was presumably done because versions >= 1800 (= Visual Studio 2013) > have their own definitions of isinf() and isnan(), and I guess that > our definitions were probably breaking stuff on that compiler. Now I understand what you mean. win32_port.h defines isnan(x) as _isnan(x) if (_MSC_VER < 1800). It doesn't look right to have the definition in here but not include as _isnan() is coming from there. I am preparing an additional patch to add the include and remove it from files where it is obviously put to work around this problem.
Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()
On 09.07.18 15:49, Heikki Linnakangas wrote: > The way > forward is to test if we can get the same performance benefit from > switching to CMPXCHG16B, and keep the WAL format unchanged. I have implemented this. I didn't see any performance benefit using the benchmark that Alexander Kuzmenkov outlined earlier in the thread. (But I also didn't see a great benefit for the originally proposed patch, so maybe I'm not doing this right or this particular hardware doesn't benefit from it as much.) I'm attaching the patches and scripts here if someone else wants to do more testing. The first patch adds a zoo of 128-bit atomics support. It's consistent with (a.k.a. copy-and-pasted from) the existing 32- and 64-bit set, but it's not the complete set, only as much as was necessary for this exercise. The second patch then makes use of that in the WAL code under discussion. pgbench invocations were: pgbench -i -I t bench pgbench -n -c $N -j $N -M prepared -f pgbench-wal-cas.sql -T 60 bench for N from 1 to 32. Note: With gcc (at least versions 7 and 8) you need to use some non-default -march setting to get 128-bit atomics to work. (Otherwise the configure test fails and the fallback implementation is used.) I have found the minimum to be -march=nocona. But different -march settings actually affect the benchmark performance, so be sure to test the baseline with the same -march setting. Recommended configure invocation: ./configure ... CC='gcc -march=whatever' clang appears to work out of the box. Also, curiously my gcc installations provide 128-bit __sync_val_compare_and_swap() but not 128-bit __atomic_compare_exchange_n() in spite of what the documentation indicates. So independent of whether this approach actually provides any benefit, the 128-bit atomics support seems a bit wobbly. (I'm also wondering why we are using __sync_val_compare_and_swap() rather than __sync_bool_compare_and_swap(), since all we're doing with the return value is emulate the bool version anyway.) -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services From 3359d5d5dd828bc0317ca28474ebe5dd931d44e8 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Wed, 4 Jul 2018 22:13:58 +0200 Subject: [PATCH 1/2] Add some int128 atomics support --- config/c-compiler.m4 | 31 + configure | 65 ++ configure.in | 2 + src/backend/port/atomics.c | 51 +++ src/include/pg_config.h.in | 8 +++ src/include/port/atomics.h | 49 ++ src/include/port/atomics/fallback.h| 30 + src/include/port/atomics/generic-gcc.h | 46 + src/include/port/atomics/generic.h | 91 ++ 9 files changed, 373 insertions(+) diff --git a/config/c-compiler.m4 b/config/c-compiler.m4 index 9731a517de..a4aecd3e61 100644 --- a/config/c-compiler.m4 +++ b/config/c-compiler.m4 @@ -606,6 +606,21 @@ if test x"$pgac_cv_gcc_sync_int64_cas" = x"yes"; then AC_DEFINE(HAVE_GCC__SYNC_INT64_CAS, 1, [Define to 1 if you have __sync_val_compare_and_swap(int64 *, int64, int64).]) fi])# PGAC_HAVE_GCC__SYNC_INT64_CAS +# PGAC_HAVE_GCC__SYNC_INT128_CAS +# - +# Check if the C compiler understands __sync_compare_and_swap() for 128bit +# types, and define HAVE_GCC__SYNC_INT128_CAS if so. +AC_DEFUN([PGAC_HAVE_GCC__SYNC_INT128_CAS], +[AC_CACHE_CHECK(for builtin __sync int128 atomic operations, pgac_cv_gcc_sync_int128_cas, +[AC_LINK_IFELSE([AC_LANG_PROGRAM([], + [PG_INT128_TYPE lock = 0; + __sync_val_compare_and_swap(&lock, 0, (PG_INT128_TYPE) 37);])], + [pgac_cv_gcc_sync_int128_cas="yes"], + [pgac_cv_gcc_sync_int128_cas="no"])]) +if test x"$pgac_cv_gcc_sync_int128_cas" = x"yes"; then + AC_DEFINE(HAVE_GCC__SYNC_INT128_CAS, 1, [Define to 1 if you have __sync_val_compare_and_swap(int128 *, int128, int128).]) +fi])# PGAC_HAVE_GCC__SYNC_INT128_CAS + # PGAC_HAVE_GCC__ATOMIC_INT32_CAS # - # Check if the C compiler understands __atomic_compare_exchange_n() for 32bit @@ -638,6 +653,22 @@ if test x"$pgac_cv_gcc_atomic_int64_cas" = x"yes"; then AC_DEFINE(HAVE_GCC__ATOMIC_INT64_CAS, 1, [Define to 1 if you have __atomic_compare_exchange_n(int64 *, int64 *, int64).]) fi])# PGAC_HAVE_GCC__ATOMIC_INT64_CAS +# PGAC_HAVE_GCC__ATOMIC_INT128_CAS +# - +# Check if the C compiler understands __atomic_compare_exchange_n() for 128bit +# types, and define HAVE_GCC__ATOMIC_INT128_CAS if so. +AC_DEFUN([PGAC_HAVE_GCC__ATOMIC_INT128_CAS], +[AC_CACHE_CHECK(for builtin __atomic int128 atomic operations, pgac_cv_gcc_atomic_int128_cas, +[AC_LINK_IFELSE([AC_LANG_PROGRAM([], + [PG_INT128_TYPE val = 0; + PG_INT128_TYPE expect = 0; + __atomic_compare_exchange_n(&val, &expect, 37, 0, __ATOMIC_SEQ_CST, __ATOMIC_RELAXED);])], + [pgac_cv_gcc_atomic_int128_
Re: [PATCH] Timestamp for a XLOG_BACKUP_END WAL-record
On 10.07.2018 06:45, Andres Freund wrote: Hi, On 2018-07-10 06:41:32 +0500, Andrey V. Lepikhov wrote: This functionality is needed in practice when we have to determine a recovery time of specific backup. What do you mean by "recovery time of specific backup"? recovery time - is a time point where backup of PostgreSQL database instance was made. Performing database recovery, we want to know what point in time the restored database will correspond to. This functionality refers to improving the usability of pg_basebackup and pg_probackup utilities. This code developed in compatibility with WAL segments, which do not have a timestamp in a XLOG_BACKUP_END record. I don't understand what "compatibility with WAL segments" could mean? And how are WAL segments related to "XLOG_BACKUP_END record", except as to how every WAL record is related? Are you thinking about the switch records? In this case 'compatibility' means that patched postgres codes (pg_basebackup, pg_probackup, pg_waldump etc) will correctly read WAL segments which not contains a timestamp field in XLOG_BACKUP_END record. Greetings, Andres Freund -- Andrey Lepikhov Postgres Professional: https://postgrespro.com The Russian Postgres Company
Costing bug in hash join logic for semi joins
There is a costing bug in hash join logic seems to have been introduced by the patch related to inner_unique enhancements(commit: 9c7f5229ad68d7e0e4dd149e3f80257893e404d4). Specifically, "hashjointuples" which tracks the number of matches for hash clauses is computed wrong for inner unique scenario. This leads to lot of semi-joins incorrectly turn to inner joins with unique on inner side. Function "final_cost_hashjoin" has special handling to cost semi/anti joins to account for early stop after the first match. This is enhanced by the above said commit to be used for inner_unique scenario also. However, "hashjointuples" computation is not fixed to handle this new scenario which is incorrectly stepping into the anti join logic and assuming unmatched rows. Fix is to handle inner_unique case when computing "hashjointuples". Here is the outline of the code that shows the bug. void final_cost_hashjoin(PlannerInfo *root, HashPath *path, JoinCostWorkspace *workspace, JoinPathExtraData *extra) { . /* CPU costs */ *if (path->jpath.jointype == JOIN_SEMI || path->jpath.jointype == JOIN_ANTI ||extra->inner_unique)* { .. */* Get # of tuples that will pass the basic join */ if (path->jpath.jointype == JOIN_SEMI) hashjointuples = outer_matched_rows; else hashjointuples = outer_path_rows - outer_matched_rows; * } else { . } } Thanks, RK (Salesforce)
user-friendliness improvement of pageinspect
Hi, Why does the heap_page_item () of the pageinspect extension not consider providing better user-friendliness? My test table has the following data, and when I look at the t_data I see data of type bytea instead of a more intuitive type, even the same type as the original table. # select * from test1; a | b ---+-- 1 | text (1 row) # SELECT lp as tuple, t_xmin, t_xmax, t_attrs[1] as a, t_attrs[2] as b FROM heap_page_item_attrs(get_raw_page('test1', 0), 'test1'); tuple | t_xmin | t_xmax | a | b ---++++-- 1 |587 | 0 | \x0100 | \x0b74657874 (1 row) Similar to this effect: # SELECT lp as tuple, t_xmin, t_xmax, t_attrs[1] as a, t_attrs[2] as b FROM heap_page_item_attrs(get_raw_page('test1', 0), 'test1'); tuple | t_xmin | t_xmax | a | b ---++++-- 1 |587 | 0 | 1 | text (1 row) Look forward to the solution.
Re: [HACKERS] PoC plpgsql - possibility to force custom or generic plan
On 23.01.18 17:08, Pavel Stehule wrote: > attached updated patch This appears to be the patch of record in this thread. I think there is general desire for adding a setting like this, and the implementation is simple enough. One change perhaps: How about naming the default setting "auto" instead of "default". That makes it clearer what it does. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Costing bug in hash join logic for semi joins
On 10 July 2018 at 11:44, RK wrote: > There is a costing bug in hash join logic seems to have been introduced by > the patch related to inner_unique enhancements(commit: > 9c7f5229ad68d7e0e4dd149e3f80257893e404d4). Specifically, "hashjointuples" > which tracks the number of matches for hash clauses is computed wrong for > inner unique scenario. This leads to lot of semi-joins incorrectly turn to > inner joins with unique on inner side. Function "final_cost_hashjoin" has > special handling to cost semi/anti joins to account for early stop after the > first match. This is enhanced by the above said commit to be used for > inner_unique scenario also. However, "hashjointuples" computation is not > fixed to handle this new scenario which is incorrectly stepping into the > anti join logic and assuming unmatched rows. Fix is to handle inner_unique > case when computing "hashjointuples". Here is the outline of the code that > shows the bug. Thanks for the analysis and the report. I agree the code is wrong. Looks simple enough just to handle the semi and unique joins in the else clause and make the if handle the ANTI join. I've done that in the attached. Also on reading the comment above, it looks slightly incorrect. To me, it looks like it's applying a twentieth of the cost and not a tenth as the comment claims. I couldn't resist updating that too. I didn't feel the need to add any regression tests around this. It seems like an unlikely bug to ever appear again. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services fix_unique_hash_join_costing.patch Description: Binary data
Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.
On Tue, Jul 10, 2018 at 9:08 AM, Etsuro Fujita wrote: > (2018/07/09 20:43), Ashutosh Bapat wrote: >>> >>> >>> I don't have any numbers right now, so that is nothing but a concern. But >>> as >>> I said in a previous email, in the approach I proposed, we don't need to >>> spend extra cycles where partitioning is not involved. I think that is a >>> good thing in itself. No? >> >> >> At the cost of having targetlist being type inconsistent. I don't have >> any testcase either to show that that's a problem in practice. So, >> it's a trade-off of a concern vs concern. > > > As I said before, I don't see any issue in having such a targetlist, but I > might be missing something, so I'd like to discuss a bit more about that. > Could you tell me the logic/place in the PG code where you think the problem > might occur. (IIRC, you mentioned something about that before (pathkeys? or > index-only scans?), but sorry, I don't understand that.) IIUC, index-only scan will be used when the all the required columns are covered by an index. If there is an index on the whole-row reference of the parent, it will be translated into a ConvertRowtypeExpr of the child when an index on the child is created. If the targetlist doesn't have ConvertRowtypeExpr, as your patch does, the planner won't be able to use such an index on the child table. But I couldn't create an index with a whole-row reference in it. So, I think this isn't possible right now. But in future if we allow creating index on whole-row reference or Pathkey points to an equivalence class, which contains equivalence members. A parent equivalence class member containing a whole-row reference gets translated into a child equivalence member containing a ConvertRowtypeExpr. At places in planner we match equivalence members to the targetlist entries. This matching will fail unexpectedly when ConvertRowtypeExpr is removed from a child's targetlist. But again I couldn't reproduce a problem when such a mismatch arises. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
RE: Locking B-tree leafs immediately in exclusive mode
On Mon, July 9, 2018 at 4:44 PM, Alexander Korotkov wrote: > > Firstly, I did performance tests on 72-cores machines(AWS c5.18xlarge) same > > as you did. > > OK. But not that c5.18xlarge is 72-VCPU machine, which AFAIK is > close to the performance of 36 physical cores. Thanks for pointing that. I referred to /proc/cpuinfo and understood there are 36 physical cores. > In this case it also looks like we observed 1% regression. Despite 1% > may seem to be very small, I think we should clarify whether it really > exists. I have at least two hypothesis about this. > > 1) There is no real regression, observed difference of TPS is less > than error of measurements. In order to check that we need to retry > the experiment multiple times. Also, if you run benchmark on master > before patched version (or vice versa) you should also try to swap the > order to make sure there is no influence of the order of benchmarks. > 2) If we consider relation between TPS and number of clients, TPS is > typically growing with increasing number of clients until reach some > saturation value. After the saturation value, there is some > degradation of TPS. If patch makes some latency lower, that my cause > saturation to happen earlier. In order to check that, we need run > benchmarks with various number of clients and draw a graph: TPS > depending on clients. > > So, may I ask you to make more experiments in order to clarify the > observed regression? I experimented 2) with changing clients parameter with 18, 36, 54, 72. While doing experiment, I realized that results of pgbench with 36 clients improve after executing pgbench with 72 clients. I don't know why this occurs, but anyway, in this experiment, I executed pgbench with 72 clients before executing other pgbenchs. (e.g. -c 72, -c 18, -c 36, -c 54, -c 72) I tested experiments to master and patched unorderly(e.g. master, patched, patched, master, master, patched, patched, master) # results of changing clients(18, 36, 54, 72 clients) master, -c 18 -j 18: Ave 400410 TPS (407615,393942,401845,398241) master, -c 36 -j 36: Ave 415616 TPS (411939,400742,424855,424926) master, -c 54 -j 54: Ave 378734 TPS (401646,354084,408044,351163) master, -c 72 -j 72: Ave 360864 TPS (367718,360029,366432,349277) patched, -c 18 -j 18: Ave 393115 TPS (382854,396396,395530,397678) patched, -c 36 -j 36: Ave 390328 TPS (376100,404873,383498,396840) patched, -c 54 -j 54: Ave 364894 TPS (365533,373064,354250,366727) patched, -c 72 -j 72: Ave 353982 TPS (355237,357601,345536,357553) It may seem saturation is between 18 and 36 clients, so I also experimented with 27 clients. # results of changing clients(27 clients) master, -c 27 -j 27: Ave 416756 (423512,424241,399241,420030) TPS patched, -c 27 -j 27: Ave 413568 (410187,404291,420152,419640) TPS I created a graph and attached in this mail("detecting saturation.png"). Referring to a graph, patched version's saturation happens earlier than master's one as you expected. But even the patched version's nearly saturated TPS value has small regression from the master's one, I think. Is there another experiments to do about this? Yoshikazu Imai
Re: Locking B-tree leafs immediately in exclusive mode
On Mon, Jul 9, 2018 at 8:18 PM Peter Geoghegan wrote: > On Mon, Jul 9, 2018 at 9:43 AM, Alexander Korotkov > wrote: > > In this case it also looks like we observed 1% regression. Despite 1% > > may seem to be very small, I think we should clarify whether it really > > exists. I have at least two hypothesis about this. > > > > 1) There is no real regression, observed difference of TPS is less > > than error of measurements. In order to check that we need to retry > > the experiment multiple times. Also, if you run benchmark on master > > before patched version (or vice versa) you should also try to swap the > > order to make sure there is no influence of the order of benchmarks. > > 2) If we consider relation between TPS and number of clients, TPS is > > typically growing with increasing number of clients until reach some > > saturation value. After the saturation value, there is some > > degradation of TPS. If patch makes some latency lower, that my cause > > saturation to happen earlier. In order to check that, we need run > > benchmarks with various number of clients and draw a graph: TPS > > depending on clients. > > > > So, may I ask you to make more experiments in order to clarify the > > observed regression? > > It would be nice to actually see script_duplicated.sql. I don't know > exactly what the test case was. BTW, contents of script_duplicated.sql was posted by Imai, Yoshikazu in the message body: > # script_duplicated.sql > INSERT INTO unordered VALUES (1, 'abcdefghijklmnoprsqtuvwxyz'); -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Locking B-tree leafs immediately in exclusive mode
On Tue, Jul 10, 2018 at 2:19 PM Imai, Yoshikazu wrote: > On Mon, July 9, 2018 at 4:44 PM, Alexander Korotkov wrote: > > > Firstly, I did performance tests on 72-cores machines(AWS c5.18xlarge) > > > same as you did. > > > > OK. But not that c5.18xlarge is 72-VCPU machine, which AFAIK is > > close to the performance of 36 physical cores. > > Thanks for pointing that. I referred to /proc/cpuinfo and understood there > are 36 physical cores. > > > In this case it also looks like we observed 1% regression. Despite 1% > > may seem to be very small, I think we should clarify whether it really > > exists. I have at least two hypothesis about this. > > > > 1) There is no real regression, observed difference of TPS is less > > than error of measurements. In order to check that we need to retry > > the experiment multiple times. Also, if you run benchmark on master > > before patched version (or vice versa) you should also try to swap the > > order to make sure there is no influence of the order of benchmarks. > > 2) If we consider relation between TPS and number of clients, TPS is > > typically growing with increasing number of clients until reach some > > saturation value. After the saturation value, there is some > > degradation of TPS. If patch makes some latency lower, that my cause > > saturation to happen earlier. In order to check that, we need run > > benchmarks with various number of clients and draw a graph: TPS > > depending on clients. > > > > So, may I ask you to make more experiments in order to clarify the > > observed regression? > > I experimented 2) with changing clients parameter with 18, 36, 54, 72. > While doing experiment, I realized that results of pgbench with 36 clients > improve after executing pgbench with 72 clients. > I don't know why this occurs, but anyway, in this experiment, I executed > pgbench with 72 clients before executing other pgbenchs. (e.g. -c 72, -c 18, > -c 36, -c 54, -c 72) > I tested experiments to master and patched unorderly(e.g. master, patched, > patched, master, master, patched, patched, master) > > # results of changing clients(18, 36, 54, 72 clients) > master, -c 18 -j 18: Ave 400410 TPS (407615,393942,401845,398241) > master, -c 36 -j 36: Ave 415616 TPS (411939,400742,424855,424926) > master, -c 54 -j 54: Ave 378734 TPS (401646,354084,408044,351163) > master, -c 72 -j 72: Ave 360864 TPS (367718,360029,366432,349277) > patched, -c 18 -j 18: Ave 393115 TPS (382854,396396,395530,397678) > patched, -c 36 -j 36: Ave 390328 TPS (376100,404873,383498,396840) > patched, -c 54 -j 54: Ave 364894 TPS (365533,373064,354250,366727) > patched, -c 72 -j 72: Ave 353982 TPS (355237,357601,345536,357553) > > It may seem saturation is between 18 and 36 clients, so I also experimented > with 27 clients. > > # results of changing clients(27 clients) > master, -c 27 -j 27: Ave 416756 (423512,424241,399241,420030) TPS > patched, -c 27 -j 27: Ave 413568 (410187,404291,420152,419640) TPS > > I created a graph and attached in this mail("detecting saturation.png"). > Referring to a graph, patched version's saturation happens earlier than > master's one as you expected. > But even the patched version's nearly saturated TPS value has small > regression from the master's one, I think. > > Is there another experiments to do about this? Thank you for the experiments! It seems that there is real regression here... BTW, which script were you using in this benchmark: script_unordered.sql or script_duplicated.sql? -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Excessive PostmasterIsAlive calls slow down WAL redo
On 27/06/18 08:26, Thomas Munro wrote: On Wed, Apr 25, 2018 at 6:23 PM, Thomas Munro wrote: On Tue, Apr 24, 2018 at 7:37 PM, Michael Paquier wrote: Thomas, trying to understand here... Why this place for the signal initialization? Wouldn't InitPostmasterChild() be a more logical place as we'd want to have this logic caught by all other processes? Yeah, you're right. Here's one like that. Amazingly, due to the way the project schedules fell and thanks to the help of a couple of very supportive FreeBSD committers and reviewers, the new PROC_PDEATHSIG_CTL kernel facility[1] landed in a production release today, beating the Commitfest by several days. My question is whether this patch set is enough, or people (Andres?) want to go further and actually kill the postmaster death pipe completely. My kqueue patch would almost completely kill it on BSDs as it happens, but for Linux there are a number of races and complications to plug to do that IIUC. I don't immediately see what there is to gain by doing that work (assuming we reuse WaitEventSet objects in enough places), and we'll always need to maintain the pipe code for other OSes anyway. So I'm -0.5 on doing that, even though the technical puzzle is appealing... [1] https://www.freebsd.org/cgi/man.cgi?query=procctl&apropos=0&sektion=2&manpath=FreeBSD+11.2-RELEASE&arch=default&format=html Yeah, I don't think we can kill the pipe completely. As long as we still need it for the other OSes, I don't see much point in eliminating it on Linux and BSDs either. I'd rather keep the code similar across platforms. Looking at the patch: The 'postmaster_possibly_dead' flag is not reset anywhere. So if a process receives a spurious death signal, even though postmaster is still alive, PostmasterIsAlive() will continue to use the slow path. postmaster_possibly_dead needs to be marked as 'volatile', no? The autoconf check for PR_SET_PDEATHSIG seems slightly misplaced. And I think we can simplify it with AC_CHECK_HEADER(). I'd also like to avoid adding code to c.h for this, that seems too global. After some kibitzing, I ended up with the attached. It fixes the postmaster_possible_dead issues mentioned above, and moves around the autoconf and #ifdef logic a bit to make it a bit nicer, at least in my opinion. I don't have a FreeBSD machine at hand, so I didn't try fixing that patch. - Heikki >From 9418ed472d113a80bf0fbb209efb0835767a5c50 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Tue, 10 Jul 2018 14:31:48 +0300 Subject: [PATCH 1/1] Use signals for postmaster death on Linux. Linux provides a way to ask for a signal when your parent process dies. Use that to make PostmasterIsAlive() very cheap. Author: Thomas Munro, based on a suggestion from Andres Freund Reviewed-By: Michael Paquier Discussion: https://postgr.es/m/7261eb39-0369-f2f4-1bb5-62f3b6083b5e%40iki.fi Discussion: https://postgr.es/m/20180411002643.6buofht4ranhei7k%40alap3.anarazel.de --- configure | 2 +- configure.in | 2 +- src/backend/storage/ipc/latch.c| 6 +- src/backend/storage/ipc/pmsignal.c | 123 + src/backend/utils/init/miscinit.c | 4 ++ src/include/pg_config.h.in | 3 + src/include/pg_config.h.win32 | 3 + src/include/storage/pmsignal.h | 33 +- 8 files changed, 157 insertions(+), 19 deletions(-) diff --git a/configure b/configure index 1bc465b942..41e0e1cf34 100755 --- a/configure +++ b/configure @@ -12494,7 +12494,7 @@ $as_echo "#define HAVE_STDBOOL_H 1" >>confdefs.h fi -for ac_header in atomic.h crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h mbarrier.h poll.h sys/epoll.h sys/ipc.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/sockio.h sys/tas.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h +for ac_header in atomic.h crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h mbarrier.h poll.h sys/epoll.h sys/ipc.h sys/prctl.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/sockio.h sys/tas.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h do : as_ac_Header=`$as_echo "ac_cv_header_$ac_header" | $as_tr_sh` ac_fn_c_check_header_mongrel "$LINENO" "$ac_header" "$as_ac_Header" "$ac_includes_default" diff --git a/configure.in b/configure.in index a6b3b88cfa..1e76c9ee46 100644 --- a/configure.in +++ b/configure.in @@ -1260,7 +1260,7 @@ AC_SUBST(UUID_LIBS) AC_HEADER_STDBOOL -AC_CHECK_HEADERS([atomic.h crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h mbarrier.h poll.h sys/epoll.h sys/ipc.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/sockio.h sys/tas.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h]) +AC_CHECK_HEADERS([atomic.h crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h mbarrier.h poll.h sys/epoll.h sys/ipc.h sys/prctl.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/sockio.h sys/tas.h
Re: shared-memory based stats collector
Hello. Thanks for the opinions. At Fri, 6 Jul 2018 13:10:36 -0700, Andres Freund wrote in <20180706201036.awheoi6tk556x...@alap3.anarazel.de> > Hi, > > On 2018-07-06 22:03:12 +0200, Magnus Hagander wrote: > > *If* we can provide the snapshots view of them without too much overhead I > > think it's worth looking into that while *also* proviiding a lower overhead > > interface for those that don't care about it. > > I don't see how that's possible without adding significant amounts of > complexity and probably memory / cpu overhead. The current stats already > are quite inconsistent (often outdated, partially updated, messages > dropped when busy) - I don't see what we really gain by building > something MVCC like in the "new" stats subsystem. > > > > If it ends up that keeping the snapshots become too much overhead in either > > in performance or code-maintenance, then I agree can probably drop that. > > But we should at least properly investigate the cost. > > I don't think it's worthwhile to more than think a bit about it. There's > fairly obvious tradeoffs in complexity here. Trying to get there seems > like a good way to make the feature too big. Agreed. Well, if we allow to lose consistency in some extent for improved performance and smaller footprint, relaxing the consistency of database stats can reduce footprint further especially on a cluster with so many databases. Backends are interested only in the residing database and vacuum doesn't cache stats at all. A possible problem is vacuum and stats collector can go into a race condition. I'm not sure but I suppose it is not worse than being involved in an IO congestion. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 889237d7381d334df3a35e1fa5350298352fb9fe Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Fri, 29 Jun 2018 16:41:04 +0900 Subject: [PATCH 1/6] sequential scan for dshash Add sequential scan feature to dshash. --- src/backend/lib/dshash.c | 138 +++ src/include/lib/dshash.h | 23 +++- 2 files changed, 160 insertions(+), 1 deletion(-) diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c index b46f7c4cfd..5b133226ac 100644 --- a/src/backend/lib/dshash.c +++ b/src/backend/lib/dshash.c @@ -592,6 +592,144 @@ dshash_memhash(const void *v, size_t size, void *arg) return tag_hash(v, size); } +/* + * dshash_seq_init/_next/_term + * Sequentially scan trhough dshash table and return all the + * elements one by one, return NULL when no more. + * + * dshash_seq_term should be called if and only if the scan is abandoned + * before completion; if dshash_seq_next returns NULL then it has already done + * the end-of-scan cleanup. + * + * On returning element, it is locked as is the case with dshash_find. + * However, the caller must not release the lock. The lock is released as + * necessary in continued scan. + * + * As opposed to the equivalent for dynanash, the caller is not supposed to + * delete the returned element before continuing the scan. + * + * If consistent is set for dshash_seq_init, the whole hash table is + * non-exclusively locked. Otherwise a part of the hash table is locked in the + * same mode (partition lock). + */ +void +dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table, +bool consistent) +{ + status->hash_table = hash_table; + status->curbucket = 0; + status->nbuckets = ((size_t) 1) << hash_table->control->size_log2; + status->curitem = NULL; + status->curpartition = -1; + status->consistent = consistent; + + /* + * Protect all partitions from modification if the caller wants a + * consistent result. + */ + if (consistent) + { + int i; + + for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i) + { + Assert(!LWLockHeldByMe(PARTITION_LOCK(hash_table, i))); + + LWLockAcquire(PARTITION_LOCK(hash_table, i), LW_SHARED); + } + } + ensure_valid_bucket_pointers(hash_table); +} + +void * +dshash_seq_next(dshash_seq_status *status) +{ + dsa_pointer next_item_pointer; + + if (status->curitem == NULL) + { + Assert (status->curbucket == 0); + Assert(!status->hash_table->find_locked); + + /* first shot. grab the first item. */ + next_item_pointer = status->hash_table->buckets[status->curbucket]; + status->hash_table->find_locked = true; + } + else + next_item_pointer = status->curitem->next; + + /* Move to the next bucket if we finished the current bucket */ + while (!DsaPointerIsValid(next_item_pointer)) + { + if (++status->curbucket >= status->nbuckets) + { + /* all buckets have been scanned. finsih. */ + dshash_seq_term(status); + return NULL; + } + Assert(status->hash_table->find_locked); + + next_item_pointer = status->hash_table->buckets[status->curbucket]; + + /* + * we need a lock on the scanning partition even if the caller don't + * requested a consistent snapshot. + */ + if (!status->consistent && DsaPointerIsValid(next_item_pointer)) + { + dshash_table_item *i
Re: EXPLAIN of Parallel Append
On Mon, Jul 9, 2018 at 7:00 PM, Jesper Pedersen wrote: > On 07/07/2018 01:08 AM, Amit Kapila wrote: >> >> On Wed, Mar 14, 2018 at 8:58 PM, Jesper Pedersen >>> >>> Parallel Append's ntuples is 1, but given nloops is 3 you end up with the >>> slightly confusing "(actual ... *rows=0* loops=3)". >>> >> >> The number of rows displayed is total_rows / loops due to which you >> are seeing these numbers. This behavior is the same for all parallel >> nodes, nothing specific to Parallel Append. >> > > Thanks ! > > Maybe something like the attached patch for the documentation is needed. > -performance characteristics of the plan. +performance characteristics of the plan. Note, that the parallel nodes +may report zero rows returned due internal calculations when one or more +rows are actually being returned. typo. /due/due to I think it is quite unclear what you mean by internal calculations. If you can come up with something similar to how we have already explained similar thing for Nest Loop Joins [1], then it would be great, you can add something like what you have written at the end of the paragraph after explaining the actual calculation. This is quite a common confusion since the parallel query is developed; if you can write some nice example and text, it would be really helpful. [1] - https://www.postgresql.org/docs/devel/static/using-explain.html#USING-EXPLAIN-ANALYZE Refer below text on that link: "In some query plans, it is possible for a subplan node to be executed more than once. For example, the inner index scan will be executed once per outer row in the above nested-loop plan. In such cases, the loops value reports the total number of executions of the node, and the actual time and rows values shown are averages per-execution. This is done to make the numbers comparable with the way that the cost estimates are shown. Multiply by the loops value to get the total time actually spent in the node." -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: shared-memory based stats collector
On 07/10/2018 02:07 PM, Kyotaro HORIGUCHI wrote: Hello. Thanks for the opinions. At Fri, 6 Jul 2018 13:10:36 -0700, Andres Freund wrote in <20180706201036.awheoi6tk556x...@alap3.anarazel.de> Hi, On 2018-07-06 22:03:12 +0200, Magnus Hagander wrote: *If* we can provide the snapshots view of them without too much overhead I think it's worth looking into that while *also* proviiding a lower overhead interface for those that don't care about it. I don't see how that's possible without adding significant amounts of complexity and probably memory / cpu overhead. The current stats already are quite inconsistent (often outdated, partially updated, messages dropped when busy) - I don't see what we really gain by building something MVCC like in the "new" stats subsystem. If it ends up that keeping the snapshots become too much overhead in either in performance or code-maintenance, then I agree can probably drop that. But we should at least properly investigate the cost. I don't think it's worthwhile to more than think a bit about it. There's fairly obvious tradeoffs in complexity here. Trying to get there seems like a good way to make the feature too big. Agreed. Well, if we allow to lose consistency in some extent for improved performance and smaller footprint, relaxing the consistency of database stats can reduce footprint further especially on a cluster with so many databases. Backends are interested only in the residing database and vacuum doesn't cache stats at all. A possible problem is vacuum and stats collector can go into a race condition. I'm not sure but I suppose it is not worse than being involved in an IO congestion. As someone who regularly analyzes stats collected from user systems, I think there's certainly some value with keeping the snapshots reasonably consistent. But I agree it doesn't need to be perfect, and some level of inconsistency is acceptable (and the amount of complexity/overhead needed to maintain perfect consistency seems rather excessive here). There's one more reason why attempts to keep stats snapshots "perfectly" consistent are likely doomed to fail - the messages are sent over UDP, which does not guarantee delivery etc. So there's always some level of possible inconsistency even with "perfectly consistent" snapshots. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
[PG-11] Potential bug related to INCLUDE clause of CREATE INDEX
Hi Dave, I am working on a feature to support INCLUDE clause of index in PG-11. As per the documentation https://www.postgresql.org/docs/11/static/ sql-createindex.html, columns listed in INCLUDE clause cannot also be present as index key columns. But I find different behaviour for below queries which are logically identical. CREATE TABLE some_table ( id serial primary key, first_name character varying(45), last_name character varying ) CREATE INDEX ind1 ON public.some_table USING btree (id) INCLUDE(id) TABLESPACE pg_default; This query fails with error ERROR: included columns must not intersect with key columns CREATE INDEX ind1 ON public.some_table USING btree (id asc nulls last) INCLUDE(id) TABLESPACE pg_default; This query passes and index is created. Kindly let me know if I am missing anything. -- Thanks and Regards, Aditya Toshniwal Software Engineer | EnterpriseDB Software Solutions | Pune "Don't Complain about Heat, Plant a tree"
Re: [PG-11] Potential bug related to INCLUDE clause of CREATE INDEX
Hi Team, Please ignore the name after "Hi" in the previous mail. :/ The potential bug is a mentioned in the mail. On Tue, Jul 10, 2018 at 6:37 PM, Aditya Toshniwal < aditya.toshni...@enterprisedb.com> wrote: > Hi Dave, > > I am working on a feature to support INCLUDE clause of index in PG-11. As > per the documentation https://www.postgresql.org/docs/11/static/sql- > createindex.html, columns listed in INCLUDE clause cannot also be present > as index key columns. But I find different behaviour for below queries > which are logically identical. > > CREATE TABLE some_table > ( > id serial primary key, > first_name character varying(45), > last_name character varying > ) > > CREATE INDEX ind1 > ON public.some_table USING btree > (id) > INCLUDE(id) > TABLESPACE pg_default; > > This query fails with error > ERROR: included columns must not intersect with key columns > > CREATE INDEX ind1 > ON public.some_table USING btree > (id asc nulls last) > INCLUDE(id) > TABLESPACE pg_default; > > This query passes and index is created. > > Kindly let me know if I am missing anything. > > -- > Thanks and Regards, > Aditya Toshniwal > Software Engineer | EnterpriseDB Software Solutions | Pune > "Don't Complain about Heat, Plant a tree" > -- Thanks and Regards, Aditya Toshniwal Software Engineer | EnterpriseDB Software Solutions | Pune "Don't Complain about Heat, Plant a tree"
Re: Test for trigger condition accessing system attributes
On 10/05/18 11:58, Ashutosh Bapat wrote: Hi, I was investigating the cases when the system attributes are accessed beyond the scans. After investigating set_plan_references(), I thought that we never access system attributes beyond scans. This lead me to assume that EEOP_INNER/OUTER_SYSVAR are not needed since we do not access system attributes from an inner or outer slot. I removed the defintions and code using those and ran regression. All the tests passed. So, I was about to conclude that my assumption is correct. But then looking at TriggerEnabled() I realised that we also (ab?)use INNER/OUTER Vars for OLD/NEW tuples for trigger condition. If the WHEN condition in CREATE TRIGGER command refers to a system attribute, we will end up having INNER/OUTER var refering a system attribute, thus exercising code for EEOP_INNER/OUTER_SYSVAR. Here's patch containing a testcase exercizing that code using EEOP_INNER/OUTER_SYSVAR. Pushed with some small changes. Thanks! - Heikki
Re: [PATCH] Add support for ON UPDATE/DELETE actions on ALTER CONSTRAINT
On Wed, Mar 7, 2018 at 11:49 PM, Matheus de Oliveira wrote: > > > Em 3 de mar de 2018 19:32, "Peter Eisentraut" > escreveu: > > On 2/20/18 10:10, Matheus de Oliveira wrote: >> Besides that, there is a another change in this patch on current ALTER >> CONSTRAINT about deferrability options. Previously, if the user did >> ALTER CONSTRAINT without specifying an option on deferrable or >> initdeferred, it was implied the default options, so this: >> >> ALTER TABLE tbl >> ALTER CONSTRAINT con_name; >> >> Was equivalent to: >> >> ALTER TABLE tbl >> ALTER CONSTRAINT con_name NOT DEFERRABLE INITIALLY IMMEDIATE; > > Oh, that seems wrong. Probably, it shouldn't even accept that syntax > with an empty options list, let alone reset options that are not > mentioned. > > > Yeah, it felt really weird when I noticed it. And I just noticed while > reading the source. > > Can > > you prepare a separate patch for this issue? > > > I can do that, no problem. It'll take awhile though, I'm on a trip and will > be home around March 20th. Matheus, When do you think you can provide the patch for bug fix? Also, the patch you originally posted doesn't apply cleanly. Can you please post a rebased version? The patch contains 70 odd lines of test SQL and 3600 odd lines of output. The total patch is 4200 odd lines. I don't think that it will be acceptable to add that huge an output to the regression test. You will need to provide a patch with much smaller output addition and may be a smaller test as well. > > You think this should be applied to all versions that support ALTER > CONSTRAINT, right? I think so. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: Test for trigger condition accessing system attributes
Thanks Heikki. On Tue, Jul 10, 2018 at 6:47 PM, Heikki Linnakangas wrote: > On 10/05/18 11:58, Ashutosh Bapat wrote: >> >> Hi, >> I was investigating the cases when the system attributes are accessed >> beyond the scans. After investigating set_plan_references(), I thought >> that we never access system attributes beyond scans. This lead me to >> assume that EEOP_INNER/OUTER_SYSVAR are not needed since we do not >> access system attributes from an inner or outer slot. I removed the >> defintions and code using those and ran regression. All the tests >> passed. So, I was about to conclude that my assumption is correct. But >> then looking at TriggerEnabled() I realised that we also (ab?)use >> INNER/OUTER Vars for OLD/NEW tuples for trigger condition. If the WHEN >> condition in CREATE TRIGGER command refers to a system attribute, we >> will end up having INNER/OUTER var refering a system attribute, thus >> exercising code for EEOP_INNER/OUTER_SYSVAR. >> >> Here's patch containing a testcase exercizing that code using >> EEOP_INNER/OUTER_SYSVAR. > > > Pushed with some small changes. Thanks! > > - Heikki -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: Retrieve memory size allocated by libpq
Attached is an updated patch of PQresultMemsize(). The implementation is unchanged, but I added SGML documentation about this new function. I'd be pleased about comments to adding this to libpq. -- Kind Regards, Lars From 766a7f2381ae6c9d442a7359cabc58515186f8c4 Mon Sep 17 00:00:00 2001 From: Lars Kanis Date: Sat, 23 Jun 2018 19:34:11 +0200 Subject: [PATCH] libpq: Add function PQresultMemsize() This function retrieves the number of bytes allocated for a given result. That can be used to instruct the GC about the memory consumed behind a wrapping object and for diagnosing memory consumption. This is an alternative approach to customizable malloc/realloc/free functions as discussed here: https://www.postgresql.org/message-id/flat/20170828172834.GA71455%40TC.local#20170828172834.GA71455@TC.local --- doc/src/sgml/libpq.sgml | 28 src/interfaces/libpq/exports.txt | 1 + src/interfaces/libpq/fe-exec.c | 14 +- src/interfaces/libpq/libpq-fe.h | 1 + src/interfaces/libpq/libpq-int.h | 2 ++ 5 files changed, 45 insertions(+), 1 deletion(-) diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml index d67212b831..c573c79ae3 100644 --- a/doc/src/sgml/libpq.sgml +++ b/doc/src/sgml/libpq.sgml @@ -6384,6 +6384,34 @@ void *PQresultAlloc(PGresult *res, size_t nBytes); + + + PQresultMemsize + + PQresultMemsize + + + + + + Retrieves the number of bytes allocated for a PGresult object. + +size_t PQresultMemsize(const PGresult *res); + + + + + The number of bytes includes the memory allocated for the PGresult itself, + memory to store data from the server, required internal metadata of a + PGresult object and data allocated by PQresultAlloc. + That is to say all memory which gets freed by PQclear. + + This information can be used for diagnosing memory consumption and to instruct + a garbage collector about the memory consumed behind a wrapping object. + + + + PQlibVersion diff --git a/src/interfaces/libpq/exports.txt b/src/interfaces/libpq/exports.txt index d6a38d0df8..0b50dddbb7 100644 --- a/src/interfaces/libpq/exports.txt +++ b/src/interfaces/libpq/exports.txt @@ -172,3 +172,4 @@ PQsslAttribute169 PQsetErrorContextVisibility 170 PQresultVerboseErrorMessage 171 PQencryptPasswordConn 172 +PQresultMemsize 173 diff --git a/src/interfaces/libpq/fe-exec.c b/src/interfaces/libpq/fe-exec.c index 4c0114c514..064c7a693c 100644 --- a/src/interfaces/libpq/fe-exec.c +++ b/src/interfaces/libpq/fe-exec.c @@ -166,6 +166,7 @@ PQmakeEmptyPGresult(PGconn *conn, ExecStatusType status) result->curBlock = NULL; result->curOffset = 0; result->spaceLeft = 0; + result->memsize = sizeof(PGresult); if (conn) { @@ -215,6 +216,12 @@ PQmakeEmptyPGresult(PGconn *conn, ExecStatusType status) return result; } +size_t +PQresultMemsize(const PGresult *res) +{ + return res->memsize; +} + /* * PQsetResultAttrs * @@ -567,9 +574,11 @@ pqResultAlloc(PGresult *res, size_t nBytes, bool isBinary) */ if (nBytes >= PGRESULT_SEP_ALLOC_THRESHOLD) { - block = (PGresult_data *) malloc(nBytes + PGRESULT_BLOCK_OVERHEAD); + size_t alloc_size = nBytes + PGRESULT_BLOCK_OVERHEAD; + block = (PGresult_data *) malloc(alloc_size); if (!block) return NULL; + res->memsize += alloc_size; space = block->space + PGRESULT_BLOCK_OVERHEAD; if (res->curBlock) { @@ -594,6 +603,7 @@ pqResultAlloc(PGresult *res, size_t nBytes, bool isBinary) block = (PGresult_data *) malloc(PGRESULT_DATA_BLOCKSIZE); if (!block) return NULL; + res->memsize += PGRESULT_DATA_BLOCKSIZE; block->next = res->curBlock; res->curBlock = block; if (isBinary) @@ -711,6 +721,7 @@ PQclear(PGresult *res) res->errFields = NULL; res->events = NULL; res->nEvents = 0; + res->memsize = 0; /* res->curBlock was zeroed out earlier */ /* Free the PGresult structure itself */ @@ -927,6 +938,7 @@ pqAddTuple(PGresult *res, PGresAttValue *tup, const char **errmsgp) realloc(res->tuples, newSize * sizeof(PGresAttValue *)); if (!newTuples) return false; /* malloc or realloc failed */ + res->memsize += (newSize - res->tupArrSize) * sizeof(PGresAttValue *); res->tupArrSize = newSize; res->tuples = newTuples; } diff --git a/src/interfaces/libpq/libpq-fe.h b/src/interfaces/libpq/libpq-fe.h index ed9c806861..4fd9a4fcda 100644 --- a/src/interfaces/libpq/libpq-fe.h +++ b/src/interfaces/libpq/libpq-fe.h @@ -491,6 +491,7 @@ extern int PQgetlength(const PGresult *res, int tup_num, int field_num); extern int PQgetisnull(const PGresult *res, int tup_num, int field_num); extern int PQnparams(const PGresult *res); extern Oid PQparamtype(const PGresult *res, int param_num); +extern size_t PQresultMemsize(const PGresult *res); /* Describe prepared statements and portals */ extern PGresult *PQdescribePrepared(PGconn *conn, const cha
Re: Locking B-tree leafs immediately in exclusive mode
On 2018/07/10 20:36, Alexander Korotkov wrote: > On Tue, Jul 10, 2018 at 2:19 PM Imai, Yoshikazu > wrote: >> On Mon, July 9, 2018 at 4:44 PM, Alexander Korotkov wrote: Firstly, I did performance tests on 72-cores machines(AWS c5.18xlarge) same as you did. >>> >>> OK. But not that c5.18xlarge is 72-VCPU machine, which AFAIK is >>> close to the performance of 36 physical cores. >> >> Thanks for pointing that. I referred to /proc/cpuinfo and understood there >> are 36 physical cores. >> >>> In this case it also looks like we observed 1% regression. Despite 1% >>> may seem to be very small, I think we should clarify whether it really >>> exists. I have at least two hypothesis about this. >>> >>> 1) There is no real regression, observed difference of TPS is less >>> than error of measurements. In order to check that we need to retry >>> the experiment multiple times. Also, if you run benchmark on master >>> before patched version (or vice versa) you should also try to swap the >>> order to make sure there is no influence of the order of benchmarks. >>> 2) If we consider relation between TPS and number of clients, TPS is >>> typically growing with increasing number of clients until reach some >>> saturation value. After the saturation value, there is some >>> degradation of TPS. If patch makes some latency lower, that my cause >>> saturation to happen earlier. In order to check that, we need run >>> benchmarks with various number of clients and draw a graph: TPS >>> depending on clients. >>> >>> So, may I ask you to make more experiments in order to clarify the >>> observed regression? >> >> I experimented 2) with changing clients parameter with 18, 36, 54, 72. >> While doing experiment, I realized that results of pgbench with 36 clients >> improve after executing pgbench with 72 clients. >> I don't know why this occurs, but anyway, in this experiment, I executed >> pgbench with 72 clients before executing other pgbenchs. (e.g. -c 72, -c 18, >> -c 36, -c 54, -c 72) >> I tested experiments to master and patched unorderly(e.g. master, patched, >> patched, master, master, patched, patched, master) >> >> # results of changing clients(18, 36, 54, 72 clients) >> master, -c 18 -j 18: Ave 400410 TPS (407615,393942,401845,398241) >> master, -c 36 -j 36: Ave 415616 TPS (411939,400742,424855,424926) >> master, -c 54 -j 54: Ave 378734 TPS (401646,354084,408044,351163) >> master, -c 72 -j 72: Ave 360864 TPS (367718,360029,366432,349277) >> patched, -c 18 -j 18: Ave 393115 TPS (382854,396396,395530,397678) >> patched, -c 36 -j 36: Ave 390328 TPS (376100,404873,383498,396840) >> patched, -c 54 -j 54: Ave 364894 TPS (365533,373064,354250,366727) >> patched, -c 72 -j 72: Ave 353982 TPS (355237,357601,345536,357553) >> >> It may seem saturation is between 18 and 36 clients, so I also experimented >> with 27 clients. >> >> # results of changing clients(27 clients) >> master, -c 27 -j 27: Ave 416756 (423512,424241,399241,420030) TPS >> patched, -c 27 -j 27: Ave 413568 (410187,404291,420152,419640) TPS >> >> I created a graph and attached in this mail("detecting saturation.png"). >> Referring to a graph, patched version's saturation happens earlier than >> master's one as you expected. >> But even the patched version's nearly saturated TPS value has small >> regression from the master's one, I think. >> >> Is there another experiments to do about this? > > Thank you for the experiments! It seems that there is real regression > here... BTW, which script were you using in this benchmark: > script_unordered.sql or script_duplicated.sql? Sorry, I forgot to write that. I used script_unordered.sql. Yoshikazu Imai
Re: How can we submit code patches that implement our (pending) patents?
"Tsunakawa, Takayuki" writes: > From: Nico Williams [mailto:n...@cryptonector.com] >> ... But that might reduce the >> size of the community, or lead to a fork. > Yes, that's one unfortunate future, which I don't want to happen of > course. I believe PostgreSQL should accept patent for further > evolution, because PostgreSQL is now a popular, influential software > that many organizations want to join. The core team has considered this matter, and has concluded that it's time to establish a firm project policy that we will not accept any code that is known to be patent-encumbered. The long-term legal risks and complications involved in doing that seem insurmountable, given the community's amorphous legal nature and the existing Postgres license wording (neither of which are open for negotiation here). Furthermore, Postgres has always been very friendly to creation of closed-source derivatives, but it's hard to see how inclusion of patented code would not cause serious problems for those. The potential benefits of accepting patented code just don't seem to justify trying to navigate these hazards. regards, tom lane
Re: BUG #15212: Default values in partition tables don't work as expected and allow NOT NULL violation
I didn't see any hackers thread linked to this CF entry. Hence sending this mail through CF app. The patch looks good to me. It applies cleanly, compiles cleanly and make check passes. I think you could avoid condition + /* Do not override parent's NOT NULL constraint. */ + if (restdef->is_not_null) + coldef->is_not_null = restdef->is_not_null; by rewriting this line as coldef->is_not_null = coldef->is_not_null || restdef->is_not_null; The comment looks a bit odd either way since we are changing coldef->is_not_null based on restdef->is_not_null. That's because of the nature of boolean variables. May be a bit of explanation is needed. On a side note, I see coldef->constraints = restdef->constraints; Shouldn't we merge the constraints instead of just overwriting those? -- Best Wishesh Ashutosh
Re: Non-reserved replication slots and slot advancing
On 2018-Jul-10, Michael Paquier wrote: > Yep, let's change that as well. If you want to look at that stuff more > deeply, please feel free. Otherwise I could always push what I have > now. I say please push already. We can push more fixes later if they are needed. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] PoC: full merge join on comparison clause
On Tue, Jul 10, 2018 at 12:05 AM, Alexander Kuzmenkov wrote: > On 07/09/2018 04:12 PM, Ashutosh Bapat wrote: >> >> On Fri, Jul 6, 2018 at 6:31 PM, Ashutosh Bapat >> wrote: >>> >>> I will continue reviewing the patches. >>> >> Here are some more review comments > > > > Ashutosh, > > Many thanks for the review, I'm glad that we are continuing with this patch. > I'm working on your comments now, will post the updated version this week. While updating the patches, please consider adding some comments as to why only single inequality clause supported. I didn't see comments in the patch explaining that. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [PG-11] Potential bug related to INCLUDE clause of CREATE INDEX
Aditya Toshniwal writes: > I am working on a feature to support INCLUDE clause of index in PG-11. As > per the documentation https://www.postgresql.org/docs/11/static/ > sql-createindex.html, columns listed in INCLUDE clause cannot also be > present as index key columns. But I find different behaviour for below > queries which are logically identical. I wonder why there is any such restriction at all. We have never attempted to prevent the creation of "silly" indexes, eg regression=# create table some_table (id int); CREATE TABLE regression=# create index on some_table (id,id); CREATE INDEX regression=# \d+ some_table Table "public.some_table" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description +-+---+--+-+-+--+- id | integer | | | | plain | | Indexes: "some_table_id_id1_idx" btree (id, id) So my inclination is to rip out the "must not intersect" test altogether, not try to make it a bit smarter. regards, tom lane
Re: [HACKERS] WAL logging problem in 9.4.3?
Thanks for picking this up! (I hope this gets through the email filters this time, sending a shell script seems to be difficult. I also trimmed the CC list, if that helps.) On 04/07/18 07:59, Michael Paquier wrote: Hence I propose the patch attached which disables the TRUNCATE and COPY optimizations for two cases, which are the ones actually causing problems. One solution has been presented by Simon here for COPY, which is to disable the optimization when there are no blocks on a relation with wal_level = minimal: https://www.postgresql.org/message-id/CANP8+jKN4V4MJEzFN_iEtdZ+1oM=yetxvmuu1yk4umxqy2g...@mail.gmail.com For back-patching, I find that really appealing. This fails in the case that there are any WAL-logged changes to the table while the COPY is running. That can happen at least if the table has an INSERT trigger, that performs operations on the same table, and the COPY fires the trigger. That scenario is covered by the little bash script I posted earlier in this thread (https://www.postgresql.org/message-id/55AFC302.1060805%40iki.fi). Attached is a new version of that script, updated to make it work with v11. The second thing that the patch attached does is to tweak ExecuteTruncateGuts so as the TRUNCATE optimization never runs for wal_level = minimal. If we go down that route, let's at least keep the TRUNCATE optimization for temporary and unlogged tables. - Heikki #!/bin/bash set -e export PGDATABASE=postgres PATH=./bin:$PATH cat > /tmp/test-copydata <> data-minimal/postgresql.conf echo "wal_level=minimal" >> data-minimal/postgresql.conf # CREATE, INSERT, COPY, crash. # # If COPY inserts to the existing block, and is not WAL-logged, replaying # the implicit FPW of the INSERT record will destroy the COPY data. pg_ctl -D data-minimal -w start psql <= 16384; INSERT INTO test1 VALUES ('inserted row'); \copy test1 FROM '/tmp/test-copydata' COMMIT; EOF pg_ctl -D data-minimal stop -m immediate sleep 1 pg_ctl -D data-minimal -w start echo "Should have 4 rows:" psql -c "SELECT * FROM test1" psql -c "DROP TABLE test1" > /dev/null # cleanup # CREATE, COPY, crash. Trigger in COPY that inserts more to same table. # # If the INSERTS from the trigger go to the same block we're copying to, # and the INSERTs are WAL-logged, WAL replay will fail when it tries to # replay the WAL record but the "before" image doesn't match, because not # all changes were WAL-logged. #pg_ctl -D data-minimal -w start psql <= 16384; \copy test1 FROM '/tmp/test-copydata' COMMIT; EOF pg_ctl -D data-minimal stop -m immediate sleep 1 pg_ctl -D data-minimal -w start echo "Should have 6 rows (3 original and 3 inserted by trigger):" psql -c "SELECT * FROM test1" psql -c "DROP TABLE test1" > /dev/null # cleanup psql -c "DROP FUNCTION test1_beforetrig()" > /dev/null # cleanup # CREATE, TRUNCATE, COPY, crash. # # If we skip WAL-logging of the COPY, replaying the TRUNCATE record destroy # the newly inserted data. #pg_ctl -D data-minimal -w start psql <= 16384; TRUNCATE test1; SELECT relname, relfilenode from pg_class where relfilenode >= 16384; \copy test1 FROM '/tmp/test-copydata' COMMIT; EOF pg_ctl -D data-minimal stop -m immediate sleep 1 pg_ctl -D data-minimal -w start echo "Should have 3 rows:" psql -c "SELECT * FROM test1"
Re: [PG-11] Potential bug related to INCLUDE clause of CREATE INDEX
On Tue, Jul 10, 2018 at 6:37 PM, Aditya Toshniwal wrote: > Hi Dave, > > I am working on a feature to support INCLUDE clause of index in PG-11. As > per the documentation > https://www.postgresql.org/docs/11/static/sql-createindex.html, columns > listed in INCLUDE clause cannot also be present as index key columns. But I > find different behaviour for below queries which are logically identical. > > > CREATE INDEX ind1 > ON public.some_table USING btree > (id asc nulls last) > INCLUDE(id) > TABLESPACE pg_default; > > This query passes and index is created. > > Kindly let me know if I am missing anything. > Seems like a bug to me. I think the problem is while checking whether the INCLUDE column intersects with the index key or not it will compare the "IndexElem" of INCLUDE with the "IndexElem" of the index key. So if any field of the "IndexElem" is not same then it will be considered as non-intersecting and in this example, the ORDER is not matching. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] PoC: full merge join on comparison clause
I tried to fix the things you mentioned and improve the comments. Among other changes, there is now a description of how merge join works with inequalities at the top of nodeMergejoin.c. It also explains why we only support one inequality clause. Some particular points: On 07/06/2018 04:01 PM, Ashutosh Bapat wrote: -StrategyNumber opstrategy = mergestrategies[iClause]; +StrategyNumber sort_strategy = mergestrategies[iClause]; -intop_strategy; +intjoin_strategy; I don't see a reason why should we change the name of variable here. These are operator strategies and there's no need to change their names. The name change is introducing unnecessary diffs. These variables have different meaning but their names differ only with an underscore. When I had to change this function, I made mistakes because of this. I'd keep the descriptive names to avoid further confusion. Should this be a separate patch? I think we should write a wrapper around MJCompare which interprets the result rather than changing MJCompare itself. OR at least change the name of MJCompare. Renamed the function to MJTestTuples to reflect that it decides whether we join tuples or advance either side. - * for get_mergejoin_opfamilies(). + * for get_equality_opfamilies(). I think we should leave mergejoin word in there or at least indicate that these are btree opfamilies so that we don't confuse it with hash equality operator families. Renamed these to get_btree_equality_opfamilies() and get_btree_comparison_opfamilies(). +if (parent->mj_Ineq_Present) +elog(ERROR, "inequality mergejoin clause must be the last one"); + IIUC, this never happens. If it really happens, we have created a path which can not be used practically. That should never happen. It will help to add a comment here clarifying that situation. This is just a cross-check for the planner. Added a comment. We should probably use a separate error code for internal errors as opposed to user errors, but I'm not sure if we have one, I see just elog(ERROR) being used everywhere. +boolhave_inequality = have_inequality_mergeclause(mergeclauses); There will be many paths created with different ordering of pathkeys. So, instead of calling have_inequality_mergeclause() for each of those paths, it's better to save this status in the path itself when creating the path. I removed this function altogether, because we can just check the last merge clause. When we cost the path, we already have a proper mergejoinable list of clauses, so if there is an inequality clause, it's the last one. /* Make a pathkey list with this guy first */ if (l != list_head(all_pathkeys)) +{ +if (have_inequality && l == list_tail(all_pathkeys)) +/* Inequality merge clause must be the last, we can't move it */ +break; + I am kind of baffled by this change. IIUC the way we create different orderings of pathkeys here, we are just rotating the pathkeys in circular order. This means there is exactly one ordering of pathkeys where the pathkey corresponding to the inequality clause is the last one. This code does not rotate the pathkeys circularly, but puts each of them in the first position, and keeps the rest in the original order. Say, if we have three equality pathkeys, and one inequality pathkey at the end (let's denote them as E1, E2, E3, IE), the permutations it tries will be like this: E1 E2 E3 IE E2 E1 E3 IE E3 E1 E2 IE Does this sound right? /* Might have no mergeclauses */ if (nClauses == 0) return NIL; +{ +List *ineq_clauses = find_inequality_clauses(mergeclauses); + +if (list_length(ineq_clauses) > 1) +return NIL; Without this patch, when there is an inequality clause with FULL JOIN, we will not create a merge join path because select_mergejoin_clauses() will set mergejoin_allowed to false. This means that we won't call sort_inner_and_outer(). I think this patch also has to do the same i.e. when there are more than one inequality clauses, select_mergejoin_clauses() should set mergejoin_allowed to false in case of a FULL JOIN since merge join machinary won't be able to handle that case. If we do that, we could arrange extra.mergeclause_list such that the inequality clause is always at the end thus finding inequality clause would be easy. I changed select_mergejoin_clauses() to filter multiple inequality clauses and disable join if needed. Now we can use extra inequalities as join filter, if it's not full join. I didn't reorder extra.mergeclause_list there, because this order is ignored later. select_outer_pathkeys_for_merge() chooses the order of pathkeys using some heuristics, and then find_mergeclauses_for_outer_pathkeys() reorders the clauses accordingly. -- Alexander Kuzmenkov Postgres Professional: http://www.postgr
Re: [PG-11] Potential bug related to INCLUDE clause of CREATE INDEX
Hi! > 10 июля 2018 г., в 17:54, Tom Lane написал(а): > > Aditya Toshniwal writes: >> I am working on a feature to support INCLUDE clause of index in PG-11. As >> per the documentation https://www.postgresql.org/docs/11/static/ >> sql-createindex.html, columns listed in INCLUDE clause cannot also be >> present as index key columns. But I find different behaviour for below >> queries which are logically identical. > > I wonder why there is any such restriction at all. We have never > attempted to prevent the creation of "silly" indexes [...] So my inclination > is to rip out the "must not intersect" test altogether, > not try to make it a bit smarter It seems to me valid way of reaching the completely consistent validation behavior. But there are some other validation steps that seem useful: e.g. "ERROR: including column does not support ASC/DESC options" and "ERROR: including column does not support NULLS FIRST/LAST options". IMHO it is not a bug. CREATE INDEX ON some_table(id+0) INCLUDE (id); or some similar tricks will work anyway. Best regards, Andrey Borodin.
Re: Let's remove DSM_IMPL_NONE.
committed On 10.07.18 08:49, Kyotaro HORIGUCHI wrote: > Thank you for the notice. > > At Mon, 9 Jul 2018 12:30:22 +0300, Arthur Zakirov > wrote in <20180709093021.GA9309@zakirov.localdomain> >> Hello, >> >> On Mon, Jul 09, 2018 at 06:07:24PM +0900, Kyotaro HORIGUCHI wrote: >>> The new version v4 is changed in the following points. >>> >>> - Don't renumber the DSM_IMPL symbols, just removed _NONE. >>> >>> - Rewrote the pointed comment. >>> >>> - Revised the commit message removing a mention to an >>> already-committed patch. >>> >>> - (and rebased) >> >> Just a little note. In parallel.sgml it is still mentioned that >> dynamic_shared_memory_type accepts 'none' value: >> >>> must be set to a >>> value other than none. > > Oops! Thanks. Just removed it altogether and I didn't find > another instance. > > regards. > -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] [PATCH] kNN for SP-GiST
Hi! > I'm reviewing this patch. Currently I'm trying to understand sp-gist scan > deeeper, but as for now have some small notices. I've passed through the code one more time. Here are few more questions: 1. Logic behind division of the patch into steps is described last time 2017-01-30, but ISTM actual steps have changed since than? May I ask you to write a bit about steps of the patchset? 2. The patch leaves contribs intact. Do extensions with sp-gist opclasses need to update it's behavior somehow to be used as-is? Or to support new functionality? 3. There is a patch about predicate locking in SP-GiST [0] Is this KNN patch theoretically compatible with predicate locking? Seems like it is, I just want to note that this functionality may exist. 4. Scan state now have scanStack and queue. May be it's better to name scanStack and scanQueue or stack and queue? Best regards, Andrey Borodin. [0] https://commitfest.postgresql.org/14/1215/
Re: [HACKERS] plpgsql - additional extra checks
2018-07-09 21:44 GMT+02:00 Alvaro Herrera : > > + ereport(errlevel, > > (errcode(ERRCODE_TOO_MANY_ > ROWS), > >errmsg("query returned > more than one row"), > > - errdetail ? > errdetail_internal("parameters: %s", errdetail) : 0)); > > + errdetail ? > errdetail_internal("parameters: %s", errdetail) : 0, > > + use_errhint ? > errhint("too_many_rows check of extra_%s is active.", > > + > too_many_rows_level == ERROR ? "errors" : "warnings") : 0)); > > Please write this in a way that results in less translatable messages, > and don't build setting names at runtime. Concretely I suggest this: > > errhint(too_many_rows_level == ERROR ? > gettext_noop("%s check of extra_errors is active.") : > gettext_noop("%s check of extra_warnings is active."), > "too_many_rows"); > > This way, > 1. only two messages need translation, not one per type of warning/error > 2. the translator doesn't need to scratch their head to figure out what >the second %s is > 3. they don't have to worry about introducing a typo in the string >"too_many_rows" or the strings "extra_errors", "extra_warnings". >(Laugh all you want. It's a real problem). > > If you can add a /* translator: */ comment to indicate what the first %s > is, all the better. I'm just not sure *where* it needs to go. I'm not > 100% sure the gettext_noop() is really needed either. > > > + ereport(strict_ > multiassignment_level, > > + > (errcode(ERRCODE_DATATYPE_MISMATCH), > > + > errmsg("Number of source and target fields in assignment do not match."), > > Please, no uppercase in errmsg(), and no ending period. > Thank you for notes. Tomorrow morning I'll spend few hours in train and I'll send updated patch Regards Pavel > -- > Álvaro Herrerahttps://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >
Re: [PATCH] Timestamp for a XLOG_BACKUP_END WAL-record
On Tue, Jul 10, 2018 at 6:41 PM, Andrey V. Lepikhov wrote: > > > On 10.07.2018 06:45, Andres Freund wrote: >> >> Hi, >> >> On 2018-07-10 06:41:32 +0500, Andrey V. Lepikhov wrote: >>> >>> This functionality is needed in practice when we have to determine a >>> recovery time of specific backup. >> >> >> What do you mean by "recovery time of specific backup"? >> > > recovery time - is a time point where backup of PostgreSQL database instance > was made. > Performing database recovery, we want to know what point in time the > restored database will correspond to. > This functionality refers to improving the usability of pg_basebackup and > pg_probackup utilities. Why don't you use a backup history file for that purpose? Regards, -- Fujii Masao
_isnan() on Windows
isnan() function is evidently not present on on Windows before Visual Studio 2013. We define it on win32_port.h using _isnan(). However _isnan() is also not present. It is on . The patch is attached to include this from win32_port.h. Thanks to Thomas Munro for point this out to me [1]. It is hard to notice this issue without testing the changes on Windows. [1] https://www.postgresql.org/message-id/CAEepm%3D3dE0w%3D_dOcELpPum6suze2NZwc%3DAV4T_xYrDUoHbZJvA%40mail.gmail.com win32_port_float_h_v00.patch Description: Binary data
Re: _isnan() on Windows
On 2018-Jul-10, Emre Hasegeli wrote: > isnan() function is evidently not present on on Windows > before Visual Studio 2013. We define it on win32_port.h using > _isnan(). However _isnan() is also not present. It is on . > The patch is attached to include this from win32_port.h. > > Thanks to Thomas Munro for point this out to me [1]. It is hard to > notice this issue without testing the changes on Windows. Oh, it looks like commits 33a7101281c6, 8e211f539146, 86dbbf20d849 (probably others) papered over this by the expedient of adding #include to random .c files rather than your patch, which seems the proper fix. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: no partition pruning when partitioning using array type
On 2018-Jul-09, Tom Lane wrote: > Suppose you did > > create domain overint as int; > create table pt (a overint) partition by range (a); > create table pt1 partition of pt for values from (0) to (100); > > and the system took it, and then you did > > alter domain overint add check (value > 100); > > What happens now? It scans the table to check whether any values violate that condition, and raises an error if they do: alvherre=# alter domain overint add check (value > 100); ERROR: column "a" of table "ppt1" contains values that violate the new constraint This looks sensible behavior to me. Now if you don't have any values that violate the new constraint, then the constraint can be created just fine, and you now have a partition that can never accept any values. But that doesn't seem like a terrible problem. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: no partition pruning when partitioning using array type
On 2018-May-08, Amit Langote wrote: > In HEAD, since we already fixed that case in e5dcbb88a15d [1] which is a > different piece of code anyway, the patch only serves to improve the > deparse output emitted by ruleutils.c for partition constraint expressions > where pseudo-type partition key is involved. The change can be seen in > the updated test output for create_table test. Actually, even in 11/master it also fixes this case: alvherre=# explain update p set a = a || a where a = '{1}'; QUERY PLAN ── Update on p (cost=0.00..54.03 rows=14 width=38) Update on p1 Update on p2 -> Seq Scan on p1 (cost=0.00..27.02 rows=7 width=38) Filter: (a = '{1}'::integer[]) -> Seq Scan on p2 (cost=0.00..27.02 rows=7 width=38) Filter: (a = '{1}'::integer[]) (7 filas) Because UPDATE uses the predtest.c prune code, not partprune. So it's not just some ruleutils beautification. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: no partition pruning when partitioning using array type
Alvaro Herrera writes: > On 2018-Jul-09, Tom Lane wrote: >> Suppose you did >> >> create domain overint as int; >> create table pt (a overint) partition by range (a); >> create table pt1 partition of pt for values from (0) to (100); >> >> and the system took it, and then you did >> >> alter domain overint add check (value > 100); >> >> What happens now? > It scans the table to check whether any values violate that condition, > and raises an error if they do: > alvherre=# alter domain overint add check (value > 100); > ERROR: column "a" of table "ppt1" contains values that violate the new > constraint > This looks sensible behavior to me. And what about those partition bound values? They are now illegal for the domain, so I would expect a dump/reload to fail, regardless of whether there are any values in the table. regards, tom lane
Re: New GUC to sample log queries
On 06/27/2018 11:13 PM, Adrien Nayrat wrote: >> 3) Is it intentional to only sample with log_min_duration_statement and >> not also with log_duration? It seems like it should affect both. In >> both cases, the name is too generic. Something called "log_sample_rate" >> should sample **everything**. > I do not think it could be useful to sample other case such as log_duration. > > But yes, the GUC is confusing and I am not comfortable to introduce a new GUC > in > my initial patch. > > Maybe we should adapt current GUC with something like : > > log_min_duration_statement = ,> > This give : > > log_min_duration_statement = 0,0.1 > > Equivalent to : > log_min_duration_statement = 0 > log_sample_rate = 0.1 > > Thought? > After reflection it seems a bad idea : * it breaks compatibility with external tools * it introduce a kind of "composite" GUC which may add complexity to use. For example in pg_settings view. What do you think of : log_min_duration_statement_sample ? Is it too long? I saw a few days ago this error on http://commitfest.cputube.org postgres.sgml:5202: element xref: validity error : IDREF attribute linkend references an unknown ID "log_min_duration_statement" Patch attached with fix on linkend marker Regards, -- Adrien commit d388b1e31926c2f5f12aa5453cc4df6e7d9e5f60 Author: anayrat Date: Tue Jul 10 19:58:16 2018 +0200 Add a new GUC, log_sample_rate, to log a fraction of queries. diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index e307bb4e8e..72791ad2a6 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -5190,6 +5190,26 @@ local0.*/var/log/postgresql + + log_sample_rate (real) + + log_sample_rate configuration parameter + + + + + Causes logging only a fraction of the statements when is used. The default is + 1, meaning log all statements longer than + log_min_duration_statement. Setting this to zero + disables logging, same as setting + log_min_duration_statement to + -1. Using log_sample_rate is + helpful when the traffic is too high to log all queries. + + + + diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index f4133953be..5590f9a9d4 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -2115,7 +2115,8 @@ check_log_statement(List *stmt_list) /* * check_log_duration - * Determine whether current command's duration should be logged + * Determine whether current command's duration should be logged. + * If log_sample_rate < 1.0, log only a sample. * * Returns: * 0 if no logging is needed @@ -2137,6 +2138,7 @@ check_log_duration(char *msec_str, bool was_logged) int usecs; int msecs; bool exceeded; + bool in_sample; TimestampDifference(GetCurrentStatementStartTimestamp(), GetCurrentTimestamp(), @@ -2153,7 +2155,15 @@ check_log_duration(char *msec_str, bool was_logged) (secs > log_min_duration_statement / 1000 || secs * 1000 + msecs >= log_min_duration_statement))); - if (exceeded || log_duration) + /* + * Do not log if log_sample_rate = 0. + * Log a sample if log_sample_rate <= 1 and avoid unecessary random() + * call if log_sample_rate = 1. + */ + in_sample = log_sample_rate != 0 && + (log_sample_rate == 1 || random() <= log_sample_rate * MAX_RANDOM_VALUE); + + if (exceeded && in_sample || log_duration) { snprintf(msec_str, 32, "%ld.%03d", secs * 1000 + msecs, usecs % 1000); diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 17292e04fe..daf1ae3af1 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -455,6 +455,7 @@ int log_min_messages = WARNING; int client_min_messages = NOTICE; int log_min_duration_statement = -1; int log_temp_files = -1; +double log_sample_rate = 1.0; int trace_recovery_messages = LOG; int temp_file_limit = -1; @@ -3257,6 +3258,17 @@ static struct config_real ConfigureNamesReal[] = NULL, NULL, NULL }, + { + {"log_sample_rate", PGC_SUSET, LOGGING_WHEN, + gettext_noop("Fraction of statements to log."), + gettext_noop("1.0 logs all statements."), + NULL + }, + &log_sample_rate, + 1.0, 0.0, 1.0, + NULL, NULL, NULL + }, + /* End-of-list marker */ { {NULL, 0, 0, NULL, NULL}, NULL, 0.0, 0.0, 0.0, NULL, NULL, NULL diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index 657c3f81f8..b8ebbe9f7f 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -446,6 +446,8 @@ # statements running at least this number # of milliseconds +#log_sample_rate = 1 # Fraction of logged statements. 1 means log all + # statements. # - What to Log - diff --git a/src/include/utils/guc.h b/src/include/utils
Re: Desirability of client-side expressions in psql?
> > >psql> \if :i >= 5 > > I think we're ok with that so long as none of the operators or values has a \ in it. What barriers do you see to re-using the pgbench grammar?
Re: user-friendliness improvement of pageinspect
On Tue, Jul 10, 2018 at 12:41 AM, 杨杰 wrote: > Why does the heap_page_item () of the pageinspect extension not consider > providing better user-friendliness? > > My test table has the following data, and when I look at the t_data I see > data of type bytea instead of a more intuitive type, even the same type as > the original table. tuple_data_split() can do this (split the data into individual attributes). If you want a friendlier, more visual way of getting this information, then you may find pg_hexedit helpful: https://github.com/petergeoghegan/pg_hexedit -- Peter Geoghegan
Re: Allow COPY's 'text' format to output a header
On 4 July 2018 at 22:44, Simon Muller wrote: > I noticed through the patch tester link at http://commitfest.cputube.org/ > that my patch caused a file_fdw test to fail (since I previously tested > only with "make check" and not with "make check-world"). > > This v2 patch should fix that. > This patch just fixes a newline issue introduced in my previous patch. text_header_v3.patch Description: Binary data
Re: Reduce amount of WAL generated by CREATE INDEX for gist, gin and sp-gist
On 28/02/18 18:03, Anastasia Lubennikova wrote: Hi, I want to propose a bunch of patches which allow to reduce WAL traffic generated by CREATE INDEX for GiST, GIN and SP-GiST. Similarly to b-tree and RUM, we can now log index pages of other access methods only once in the end of indexbuild process. Makes sense. Is there any scenario where the current method is better? In theory, doing another pass through the whole index, to create the WAL records, requires more I/O. But in practice, it seems that the reduction in WAL is almost certainly a win. Implementation is based on generic_xlog. Why? I think we should just add a log_relation() function in xloginsert.c directly, alongside log_newpage_buffer(). This makes the assumption that all the pages in these indexes used the standard page layout. I think that's a valid assumption, but needs at least a comment on that. And perhaps an Assert, to check that pd_lower/upper look sane. As a further optimization, would it be a win to WAL-log multiple pages in each record? This leaves the XLOG_*_CREATE_INDEX WAL record types unused, BTW. - Heikki
Re: Failure assertion in GROUPS mode of window function in current HEAD
Masahiko Sawada writes: > BTW, I found an another but related issue. We can got an assertion > failure also by the following query. > =# select sum(c) over (partition by c order by c groups between 1 > preceding and current row) from test; Oh, good point, that's certainly legal per spec, so we'd need to do something about it. The proximate cause of the problem, I think, is that the planner throws away the "order by c" as being redundant because it already sorted by c for the partitioning requirement. So we end up with ordNumCols = 0 even though the query had ORDER BY. This breaks not only GROUPS cases, but also RANGE OFFSET cases, because the executor expects to have an ordering column. I thought for a bit about fixing that by forcing the planner to emit the ordering column for RANGE OFFSET even if it's redundant. But it's hard to make the existing logic in get_column_info_for_window do that --- it can't tell which partitioning column the ordering column was redundant with, and even if it could, that column might've been of a different data type. So eventually I just threw away that redundant-key-elimination logic altogether. I believe that we never thought it was really useful as an optimization, but way back when window functions were put in, we didn't have (or at least didn't think about) a way to identify the partitioning/ordering columns without reference to the input pathkeys. With this patch, WindowAggPath.winpathkeys is not used for anything anymore. I'm inclined to get rid of it, though I didn't do so here. (If we keep it, we at least need to adjust the comment in relation.h that claims createplan.c needs it.) The other issue here is that nodeWindowAgg's setup logic is not quite consistent with update_frameheadpos and update_frametailpos about when to create tuplestore read pointers and slots. We might've prevented all the inconsistent cases with the parser and planner fixes, but it seems best to make them really consistent anyway, so I changed that. Draft patch attached. regards, tom lane diff --git a/src/backend/executor/nodeWindowAgg.c b/src/backend/executor/nodeWindowAgg.c index 968d5d3..0db193b 100644 *** a/src/backend/executor/nodeWindowAgg.c --- b/src/backend/executor/nodeWindowAgg.c *** begin_partition(WindowAggState *winstate *** 1079,1084 --- 1079,1085 { WindowAgg *node = (WindowAgg *) winstate->ss.ps.plan; PlanState *outerPlan = outerPlanState(winstate); + int frameOptions = winstate->frameOptions; int numfuncs = winstate->numfuncs; int i; *** begin_partition(WindowAggState *winstate *** 1143,1150 * If the frame head is potentially movable, or we have an EXCLUSION * clause, we might need to restart aggregation ... */ ! if (!(winstate->frameOptions & FRAMEOPTION_START_UNBOUNDED_PRECEDING) || ! (winstate->frameOptions & FRAMEOPTION_EXCLUSION)) { /* ... so create a mark pointer to track the frame head */ agg_winobj->markptr = tuplestore_alloc_read_pointer(winstate->buffer, 0); --- 1144,1151 * If the frame head is potentially movable, or we have an EXCLUSION * clause, we might need to restart aggregation ... */ ! if (!(frameOptions & FRAMEOPTION_START_UNBOUNDED_PRECEDING) || ! (frameOptions & FRAMEOPTION_EXCLUSION)) { /* ... so create a mark pointer to track the frame head */ agg_winobj->markptr = tuplestore_alloc_read_pointer(winstate->buffer, 0); *** begin_partition(WindowAggState *winstate *** 1182,1202 /* * If we are in RANGE or GROUPS mode, then determining frame boundaries ! * requires physical access to the frame endpoint rows, except in * degenerate cases. We create read pointers to point to those rows, to * simplify access and ensure that the tuplestore doesn't discard the ! * endpoint rows prematurely. (Must match logic in update_frameheadpos ! * and update_frametailpos.) */ winstate->framehead_ptr = winstate->frametail_ptr = -1; /* if not used */ ! if ((winstate->frameOptions & (FRAMEOPTION_RANGE | FRAMEOPTION_GROUPS)) && ! node->ordNumCols != 0) { ! if (!(winstate->frameOptions & FRAMEOPTION_START_UNBOUNDED_PRECEDING)) winstate->framehead_ptr = tuplestore_alloc_read_pointer(winstate->buffer, 0); ! if (!(winstate->frameOptions & FRAMEOPTION_END_UNBOUNDED_FOLLOWING)) winstate->frametail_ptr = tuplestore_alloc_read_pointer(winstate->buffer, 0); } --- 1183,1206 /* * If we are in RANGE or GROUPS mode, then determining frame boundaries ! * requires physical access to the frame endpoint rows, except in certain * degenerate cases. We create read pointers to point to those rows, to * simplify access and ensure that the tuplestore doesn't discard the ! * endpoint rows prematurely. (Must create pointers in exactly the same ! * cases that update_frameheadpos and update_frametailpos need them.) *
Re: no partition pruning when partitioning using array type
On 2018-Jul-10, Alvaro Herrera wrote: > alvherre=# explain update p set a = a || a where a = '{1}'; > QUERY PLAN > ── > Update on p (cost=0.00..54.03 rows=14 width=38) >Update on p1 >Update on p2 >-> Seq Scan on p1 (cost=0.00..27.02 rows=7 width=38) > Filter: (a = '{1}'::integer[]) >-> Seq Scan on p2 (cost=0.00..27.02 rows=7 width=38) > Filter: (a = '{1}'::integer[]) > (7 filas) > > Because UPDATE uses the predtest.c prune code, not partprune. So it's > not just some ruleutils beautification. I added this test, modified some comments, and pushed. Thanks for the patch. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: no partition pruning when partitioning using array type
On 2018-Jul-10, Tom Lane wrote: > And what about those partition bound values? They are now illegal > for the domain, so I would expect a dump/reload to fail, regardless > of whether there are any values in the table. Hmm, true. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: no partition pruning when partitioning using array type
On 2018-Jul-09, Tom Lane wrote: > Alvaro Herrera writes: > > However, if we take out the > > expression_planner() and replace it with a call to > > strip_implicit_coercions(), not only it magically starts working, but > > also the regression tests start failing with the attached diff, which > > seems a Good Thing to me. > > Why would you find that to be a good thing? The prohibition against > mutable coercions seems like something we need here, for more or less > the same reason in the domain example. By the way, while playing with a partition on type money and replacing expression_planner() with strip_implicit_coercions(), the stored partition bounds are completely broken -- they end up as literals of type integer rather than money, so any insert at all into the partition fails (even if the value is nominally the same). So clearly it's not a change we want. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: _isnan() on Windows
Alvaro Herrera writes: > On 2018-Jul-10, Emre Hasegeli wrote: >> isnan() function is evidently not present on on Windows >> before Visual Studio 2013. We define it on win32_port.h using >> _isnan(). However _isnan() is also not present. It is on . >> The patch is attached to include this from win32_port.h. >> >> Thanks to Thomas Munro for point this out to me [1]. It is hard to >> notice this issue without testing the changes on Windows. > Oh, it looks like commits 33a7101281c6, 8e211f539146, 86dbbf20d849 > (probably others) papered over this by the expedient of adding #include > to random .c files rather than your patch, which seems the > proper fix. I disagree --- including in c.h, as this would have us do, seems like a huge expansion of the visibility of that header. Moreover, doing that only on Windows seems certain to lead to missing-header problems in the reverse direction, ie patches developed on Windows will fail elsewhere. I think we should stick with the existing coding convention of including that only in the specific .c files that need it. regards, tom lane
Re: _isnan() on Windows
On 2018-Jul-10, Tom Lane wrote: > Alvaro Herrera writes: > > Oh, it looks like commits 33a7101281c6, 8e211f539146, 86dbbf20d849 > > (probably others) papered over this by the expedient of adding #include > > to random .c files rather than your patch, which seems the > > proper fix. > > I disagree --- including in c.h, as this would have us do, > seems like a huge expansion of the visibility of that header. Moreover, > doing that only on Windows seems certain to lead to missing-header > problems in the reverse direction, ie patches developed on Windows > will fail elsewhere. I don't think so, because that's only done for MSVC older than 2013. Nobody uses that for new development anymore. Downloading versions prior to 2017 is difficult enough already. > I think we should stick with the existing coding convention of including > that only in the specific .c files that need it. It seems saner to contain the ugliness to a port-specific file, where it won't pollute two dozen files for no reason. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: GiST VACUUM
I'm now looking at the first patch in this series, to allow completely empty GiST pages to be recycled. I've got some questions: --- a/src/backend/access/gist/gist.c +++ b/src/backend/access/gist/gist.c @@ -700,6 +700,13 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate) GISTInsertStack *item; OffsetNumber downlinkoffnum; + if(GistPageIsDeleted(stack->page)) + { + UnlockReleaseBuffer(stack->buffer); + xlocked = false; + state.stack = stack = stack->parent; + continue; + } downlinkoffnum = gistchoose(state.r, stack->page, itup, giststate); iid = PageGetItemId(stack->page, downlinkoffnum); idxtuple = (IndexTuple) PageGetItem(stack->page, iid); This seems misplaced. This code deals with internal pages, and as far as I can see, this patch never marks internal pages as deleted, only leaf pages. However, we should have something like this in the leaf-page branch, to deal with the case that an insertion lands on a page that was concurrently deleted. Did you have any tests, where an insertion runs concurrently with vacuum, that would exercise this? The code in gistbulkdelete() seems pretty expensive. In the first phase, it records the parent of every empty leaf page it encounters. In the second phase, it scans every leaf page of that parent, not only those leaves that were seen as empty. I'm a bit wary of using pd_prune_xid for the checks to determine if a deleted page can be recycled yet. In heap pages, pd_prune_xid is just a hint, but here it's used for a critical check. This seems to be the same mechanism we use in B-trees, but in B-trees, we store the XID in BTPageOpaqueData.xact, not pd_prune_xid. Also, in B-trees, we use ReadNewTransactionId() to set it, not GetCurrentTransactionId(). See comments in _bt_unlink_halfdead_page() for explanation. This patch is missing any comments to explain how this works in GiST. If you crash in the middle of gistbulkdelete(), after it has removed the downlink from the parent, but before it has marked the leaf page as deleted, the leaf page is "leaked". I think that's acceptable, but a comment at least would be good. - Heikki
Re: patch to allow disable of WAL recycling
Thanks to everyone who took the time to look at the patch and send me feedback. I'm happy to work on improving the documentation of this new tunable to clarify when it should be used and the implications. I'm trying to understand more specifically what else needs to be done next. To summarize, I think the following general concerns were brought up. 1) Disabling WAL recycling could have a negative performance impact on a COW filesystem if all WAL files could be kept in the filesystem cache. 2) Disabling WAL recycling reduces reliability, even on COW filesystems. 3) Using something like posix_fadvise to reload recycled WAL files into the filesystem cache is better even for a COW filesystem. 4) There are "several" other purposes for WAL recycling which this tunable would impact. 5) A WAL recycling tunable is too specific and a more general solution is needed. 6) Need more performance data. For #1, #2 and #3, I don't understand these concerns. It would be helpful if these could be more specific For #4, can anybody enumerate these other purposes for WAL recycling? For #5, perhaps I am making an incorrect assumption about what the original response was requesting, but I understand that WAL recycling is just one aspect of WAL file creation/allocation. However, the creation of a new WAL file is not a problem we've ever observed. In general, any modern filesystem should do a good job of caching recently accessed files. We've never observed a problem with the allocation of a new WAL file slightly before it is needed. The problem we have observed is specifically around WAL file recycling when we have to access old files that are long gone from the filesystem cache. The semantics around recycling seem pretty crisp as compared to some other tunable which would completely change how WAL files are created. Given that a change like that is also much more intrusive, it seems better to provide a tunable to disable WAL recycling vs. some other kind of tunable for which we can't articulate any improvement except in the recycling scenario. For #6, there is no feasible way for us to recreate our workload on other operating systems or filesystems. Can anyone expand on what performance data is needed? I'd like to restate the original problem we observed. When PostgreSQL decides to reuse an old WAL file whose contents have been evicted from the cache (because they haven't been used in hours), this turns what should be a workload bottlenecked by synchronous write performance (that can be well-optimized with an SSD log device) into a random read workload (that's much more expensive for any system). What's significantly worse is that we saw this on synchronous standbys. When that happened, the WAL receiver was blocked on a random read from disk, and since it's single-threaded, all write queries on the primary stop until the random read finishes. This is particularly bad for us when the sync is doing other I/O (e.g., for an autovacuum or a database backup) that causes disk reads to take hundreds of milliseconds. To summarize, recycling old WAL files seems like an optimization designed for certain filesystems that allocate disk blocks up front. Given that the existing behavior is already filesystem specific, is there specific reasons why we can't provide a tunable to disable this behavior for filesystems which don't behave that way? Thanks again, Jerry On Tue, Jun 26, 2018 at 7:35 AM, Jerry Jelinek wrote: > Hello All, > > Attached is a patch to provide an option to disable WAL recycling. We have > found that this can help performance by eliminating read-modify-write > behavior on old WAL files that are no longer resident in the filesystem > cache. The is a lot more detail on the background of the motivation for > this in the following thread. > > https://www.postgresql.org/message-id/flat/CACukRjO7DJvub8e2AijOayj8BfKK3 > XXBTwu3KKARiTr67M3E3w%40mail.gmail.com#CACukRjO7DJvub8e2AijOayj8BfKK3 > xxbtwu3kkaritr67m3...@mail.gmail.com > > A similar change has been tested against our 9.6 branch that we're > currently running, but the attached patch is against master. > > Thanks, > Jerry > >
Re: _isnan() on Windows
Alvaro Herrera writes: > On 2018-Jul-10, Tom Lane wrote: >> I disagree --- including in c.h, as this would have us do, >> seems like a huge expansion of the visibility of that header. Moreover, >> doing that only on Windows seems certain to lead to missing-header >> problems in the reverse direction, ie patches developed on Windows >> will fail elsewhere. > I don't think so, because that's only done for MSVC older than 2013. > Nobody uses that for new development anymore. Hm. OK, maybe it's all right given that. I'm still a bit worried about downsides, but no doubt the buildfarm will tell us. regards, tom lane
Re: patch to allow disable of WAL recycling
On 07/10/2018 01:15 PM, Jerry Jelinek wrote: Thanks to everyone who took the time to look at the patch and send me feedback. I'm happy to work on improving the documentation of this new tunable to clarify when it should be used and the implications. I'm trying to understand more specifically what else needs to be done next. To summarize, I think the following general concerns were brought up. For #6, there is no feasible way for us to recreate our workload on other operating systems or filesystems. Can anyone expand on what performance data is needed? I think a simple way to prove this would be to run BenchmarkSQL against PostgreSQL in a default configuration with pg_xlog/pg_wal on a filesystem that is COW (zfs) and then run another test where pg_xlog/pg_wal is patched with your patch and new behavior and then run the test again. BenchmarkSQL is a more thorough benchmarking tool that something like pg_bench and is very easy to setup. The reason you would use a default configuration is because it will cause a huge amount of wal churn, although a test with a proper wal configuration would also be good. Thanks, JD -- Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc *** A fault and talent of mine is to tell it exactly how it is. *** PostgreSQL centered full stack support, consulting and development. Advocate: @amplifypostgres || Learn: https://postgresconf.org * Unless otherwise stated, opinions are my own. *
Re: patch to allow disable of WAL recycling
On 2018-Jul-10, Jerry Jelinek wrote: > 2) Disabling WAL recycling reduces reliability, even on COW filesystems. I think the problem here is that WAL recycling in normal filesystems helps protect the case where filesystem gets full. If you remove it, that protection goes out the window. You can claim that people needs to make sure to have available disk space, but this does become a problem in practice. I think the thing to do is verify what happens with recycling off when the disk gets full; is it possible to recover afterwards? Is there any corrupt data? What happens if the disk gets full just as the new WAL file is being created -- is there a Postgres PANIC or something? As I understand, with recycling on it is easy (?) to recover, there is no PANIC crash, and no data corruption results. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Problem with tupdesc in jsonb_to_recordset
Hi, I've found out that currently in some situations jsonb_to_recordset can lead to a crash. Minimal example that I've managed to create looks like this: CREATE OR REPLACE FUNCTION test(data JSONB) RETURNS INTEGER AS $$ DECLARE test_var int; BEGIN WITH jsonb_values AS ( SELECT (SELECT SUM(value) FROM jsonb_to_recordset(element #> '{values}') AS q(value INTEGER)) AS value_sum FROM jsonb_array_elements(data) AS element ) SELECT SUM(value_sum) FROM jsonb_values INTO test_var; RETURN test_var; END; $$ LANGUAGE plpgsql; And then: =# SELECT test('[ { "values": [ { "value": 1 }, { "value": 3 } ] }, { "values": [ { "value": 1 }, { "value": 3 } ] } ]' :: JSONB); server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. After brief investigation it looks like an issue with tupdesc from the function cache: if (!cache) { fcinfo->flinfo->fn_extra = cache = MemoryContextAllocZero(fcinfo->flinfo->fn_mcxt, sizeof(*cache)); //... rsi->setDesc = cache->c.io.composite.tupdesc; Then after the first call of populate_recordset_worker rsi->setDesc is being reset since we never increased tdrefcount and the next call will use wrong cache data. Apparently, it can be fixed by incrementing a tdrefcount (see the attached patch). diff --git a/src/backend/utils/adt/jsonfuncs.c b/src/backend/utils/adt/jsonfuncs.c index e358b5a..9250646 100644 --- a/src/backend/utils/adt/jsonfuncs.c +++ b/src/backend/utils/adt/jsonfuncs.c @@ -3728,6 +3728,7 @@ populate_recordset_worker(FunctionCallInfo fcinfo, const char *funcname, rsi->setResult = state->tuple_store; rsi->setDesc = cache->c.io.composite.tupdesc; + rsi->setDesc->tdrefcount = 1; PG_RETURN_NULL(); }
Re: cost_sort() improvements
On 06/28/2018 06:47 PM, Teodor Sigaev wrote: > Hi! > > ... > > - cost for i-th columns: > Ci * log2(NGi) => Ci * log2(N/G(i-1)) > So, here I believe, that i-th comparison function will be called only > inside > group which number is defined by (i-1) leading columns. Note, following > discussion [1] I add multiplier 1.5 here to be closer to worst case when > groups are significantly differ. Right now there is no way to > determine how > differ they could be. Note, this formula describes cost of first > column too > except difference with multiplier. One more thought about estimating the worst case - I wonder if simply multiplying the per-tuple cost by 1.5 is the right approach. It does not seem particularly principled, and it's trivial simple to construct counter-examples defeating it (imagine columns with 99% of the rows having the same value, and then many values in exactly one row - that will defeat the ndistinct-based group size estimates). The reason why constructing such counter-examples is so simple is that this relies entirely on the ndistinct stats, ignoring the other stats we already have. I think this might leverage the per-column MCV lists (and eventually multi-column MCV lists, if that ever gets committed). We're estimating the number of tuples in group (after fixing values in the preceding columns), because that's what determines the number of comparisons we need to make at a given stage. The patch does this by estimating the average group size, by estimating ndistinct of the preceding columns G(i-1), and computing ntuples/G(i-1). But consider that we also have MCV lists for each column - if there is a very common value, it should be there. So let's say M(i) is a frequency of the most common value in i-th column, determined either from the MCV list or as 1/ndistinct for that column. Then if m(i) is minimum of M(i) for the fist i columns, then the largest possible group of values when comparing values in i-th column is N * m(i-1) This seems like a better way to estimate the worst case. I don't know if this should be used instead of NG(i), or if those two estimates should be combined in some way. I think estimating the largest group we need to sort should be helpful for the incremental sort patch, so I'm adding Alexander to this thread. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Optimze usage of immutable functions as relation
On 16/05/18 13:47, Aleksandr Parfenov wrote: Hello, I reworked a patch to make more stable in different cases. I decided to use simplify_function instead of eval_const_expression to prevent inlining of the function. The only possible outputs of the simplify_function are Const node and NULL. Hmm. That's still a bit inefficient, we'll try to simplify the function on every reference to it. We already simplify functions in function RTEs, but we currently do it after preprocessing all other expressions in the query. See subquery_planner(), around comment "/* Also need to preprocess expressions within RTEs */". If we moved that so that we simplify expressions in the range table first, then in the code that you're adding to eval_const_expression_mutator(), you could just check if the function expression is already a Const. But stepping back a bit, it's a bit weird that we're handling this differently from VALUES and other subqueries. The planner knows how to do this trick for simple subqueries: postgres=# explain select * from tenk1, (select abs(100)) as a (a) where unique1 < a; QUERY PLAN --- Seq Scan on tenk1 (cost=0.00..483.00 rows=100 width=248) Filter: (unique1 < 100) (2 rows) Note that it not only evaluated the function into a constant, but also got rid of the join. For a function RTE, however, it can't do that: postgres=# explain select * from tenk1, abs(100) as a (a) where unique1 < a; QUERY PLAN --- Nested Loop (cost=0.00..583.01 rows= width=248) Join Filter: (tenk1.unique1 < a.a) -> Function Scan on a (cost=0.00..0.01 rows=1 width=4) -> Seq Scan on tenk1 (cost=0.00..458.00 rows=1 width=244) (4 rows) Could we handle this in pull_up_subqueries(), similar to the RTE_SUBQUERY and RTE_VALUES cases? - Heikki
Re: Jsonb transform for pl/python
Peter Eisentraut writes: > On 6/23/18 01:44, Nikita Glukhov wrote: >> We are simply trying first to convert numeric to int64 if is does not have >> digits after the decimal point, and then construct Python Int instead of >> Decimal. Standard Python json.loads() does the same for exact integers. > We just had a very similar but not identical discussion in the thread > about the PL/Perl transforms, where we rejected the proposal. Other > comments on this one? I can sympathize with the speed concern, but the proposed patch produces a functional change (ie, you get a different kind of Python object, with different behaviors, in some cases). That seems to me to be a good reason to reject it. The fact that it makes the behavior vary depending on the local width of "long" is also a problem, although I think that could be fixed. More generally, I'd be happier with a patch that sped things up for non-integers too. I don't know whether Python exposes enough internals of type Decimal to make that practical, though. One idea for doing it at arm's length is to compute the Decimal value using numeric-digit-at-a-time arithmetic, roughly x := 0; foreach(digit, numeric) x.fma(1, digit); x.scaleb(appropriate-exponent); In principle that's probably faster than string conversion, but not sure by how much. regards, tom lane
Re: Optimze usage of immutable functions as relation
Heikki Linnakangas writes: > But stepping back a bit, it's a bit weird that we're handling this > differently from VALUES and other subqueries. The planner knows how to > do this trick for simple subqueries: > postgres=# explain select * from tenk1, (select abs(100)) as a (a) where > unique1 < a; > QUERY PLAN > --- > Seq Scan on tenk1 (cost=0.00..483.00 rows=100 width=248) > Filter: (unique1 < 100) > (2 rows) > Note that it not only evaluated the function into a constant, but also > got rid of the join. For a function RTE, however, it can't do that: > postgres=# explain select * from tenk1, abs(100) as a (a) where unique1 < a; > QUERY PLAN > --- > Nested Loop (cost=0.00..583.01 rows= width=248) > Join Filter: (tenk1.unique1 < a.a) > -> Function Scan on a (cost=0.00..0.01 rows=1 width=4) > -> Seq Scan on tenk1 (cost=0.00..458.00 rows=1 width=244) > (4 rows) > Could we handle this in pull_up_subqueries(), similar to the > RTE_SUBQUERY and RTE_VALUES cases? Perhaps. You could only do it for non-set-returning functions, which isn't the typical use of function RTEs, which is probably why we've not thought hard about it before. I'm not sure what would need to happen for lateral functions. Also to be considered, if it's not foldable to a constant, is whether we're risking evaluating it more times than before. regards, tom lane
Re: Desirability of client-side expressions in psql?
Hello Corey, psql> \if :i >= 5 I think we're ok with that so long as none of the operators or values has a \ in it. What barriers do you see to re-using the pgbench grammar? The pgbench expression grammar mimics SQL expression grammar, on integers, floats, booleans & NULL. I'm unsure about some special cases in psql (`shell command`, 'text' "identifier"). They can be forbidden on a new commande (\let), but what happens on "\if ..." which I am afraid allows them is unclear. -- Fabien.
CVE-2017-7484-induced bugs, or, btree cmp functions are not leakproof?
I poked into the odd behavior reported in bug #15251: https://www.postgresql.org/message-id/152967245839.1266.6939666809369185...@wrigleys.postgresql.org The reason for the poor plan chosen when the caller hasn't got select privilege on the child table is that statistic_proc_security_check() decides not to allow use of the stats for the child table. There are two functions for which it decides that, because they aren't leakproof: textregexeq() and btint4cmp(). Now, it's probably necessary that we consider textregexeq() leaky, since it might spit up errors about its pattern argument. But btint4cmp? It seems pretty silly that the integer comparison operators are marked leakproof while their underlying comparison function isn't. (The reason we're hitting this is that calc_arraycontsel() finds the datatype's default btree comparison function and tries to use that to estimate selectivity. While marking btint4cmp leakproof doesn't completely fix the misestimation, it goes a long way in this example.) I propose to run through the system operator classes, find any for which the comparison function isn't marked leakproof but the operators are, and fix them. This is clearly appropriate for HEAD and maybe it's not too late to force an initdb for v11 --- thoughts? Another question that could be raised is why we are refusing to use stats for a child table when the caller has select on the parent. It's completely trivial to extract data from a child table if you have select on the parent, so it seems like we are checking the wrong table's privileges. regards, tom lane
possible issue with array type created for every heap relation composite type
Hello, In Postgres 8.3, it was decided that an array type would be automatically created for every heap relation's composite type. Reference thread: https://www.postgresql.org/message-id/flat/20070302234016.GF3665%40fetter.org The possible issue I would like to note is related to how the array type is named in makeArrayTypeName() function. The composite type will take the heap relation's relname as its typname and the array type will usually be named with an underscore prepended (after first attempting to use the relname and hitting typname collision with the composite type). If the typname with the underscore prepended is already taken, the logic is to continue prepending underscores until there are no typname collisions (truncating the end of the typname if typname gets past NAMEDATALEN of 64). It is possible that enough heap relations with similar relnames could cause more and more typname collisions until you end up with typnames that primarily consist of underscores or not being able to construct a typname because we have reached a typname consisting of all underscores (which can cause table creation failure). This is more likely to happen when heap relnames are similar and of string length 63+. I can see this possibly being an issue with table partitioning if a user decides on partition names that reach string length 63+ (assuming user uses exactly 63 characters or does not hit relation already exists from relname truncation). This may also become an issue if we ever decide to do automated partition naming instead of having the user naming the partitions manually. Is this an issue we should be worrying about? When and how often are array types of heap relation's composite type used (I can only think of doing array_agg on table rows)? Should we consider coding up "CREATE TYPE foo ARRAY OF bar" as suggested in the past in the above referenced hackers thread? Attached are some SQL examples describing this issue. Thanks, Jimmy typname_collision.sql Description: Binary data typname_collision_with_partitioning.sql Description: Binary data
Re: CVE-2017-7484-induced bugs, or, btree cmp functions are not leakproof?
On 2018-Jul-10, Tom Lane wrote: > I propose to run through the system operator classes, find any for which > the comparison function isn't marked leakproof but the operators are, > and fix them. This is clearly appropriate for HEAD and maybe it's not > too late to force an initdb for v11 --- thoughts? on initdb in v11, see https://postgr.es/m/cakjs1f9cqosks9jvcbkga2mdn-24ypwc9xlzfdnsmjjmupt...@mail.gmail.com -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: possible issue with array type created for every heap relation composite type
Jimmy Yih writes: > The possible issue I would like to note is related to how the array type is > named in makeArrayTypeName() function. The composite type will take the > heap relation's relname as its typname and the array type will usually be > named with an underscore prepended (after first attempting to use the > relname and hitting typname collision with the composite type). If the > typname with the underscore prepended is already taken, the logic is to > continue prepending underscores until there are no typname collisions > (truncating the end of the typname if typname gets past NAMEDATALEN of > 64). It is possible that enough heap relations with similar relnames could > cause more and more typname collisions until you end up with typnames that > primarily consist of underscores or not being able to construct a typname > because we have reached a typname consisting of all underscores (which can > cause table creation failure). We've never heard a field report of this happening, so I'm not terribly concerned about it. regards, tom lane
Re: CVE-2017-7484-induced bugs, or, btree cmp functions are not leakproof?
Alvaro Herrera writes: > on initdb in v11, see > https://postgr.es/m/cakjs1f9cqosks9jvcbkga2mdn-24ypwc9xlzfdnsmjjmupt...@mail.gmail.com [confused...] I recall a thread mentioning that we might need initdb in v11, but that one doesn't seem to be it? regards, tom lane
Re: Non-reserved replication slots and slot advancing
On Tue, Jul 10, 2018 at 09:51:27AM -0400, Alvaro Herrera wrote: > On 2018-Jul-10, Michael Paquier wrote: > I say please push already. We can push more fixes later if they are > needed. OK, I have pushed what I have now, with all the suggestions from upthread included. -- Michael signature.asc Description: PGP signature
TRUNCATE tables referenced by FKs on partitioned tables
$subject is broken: create table prim (a int primary key); create table partfk (a int references prim) partition by range (a); create table partfk1 partition of partfk for values from (0) to (100); create table partfk2 partition of partfk for values from (100) to (200); You can't truncate prim on its own. This is expected. alvherre=# truncate table prim, partfk; ERROR: cannot truncate a table referenced in a foreign key constraint DETALLE: Table "partfk" references "prim". SUGERENCIA: Truncate table "partfk" at the same time, or use TRUNCATE ... CASCADE. However, you can't do it even if you try to include partfk in the mix: alvherre=# truncate table prim, partfk; ERROR: cannot truncate a table referenced in a foreign key constraint DETALLE: Table "partfk" references "prim". SUGERENCIA: Truncate table "partfk" at the same time, or use TRUNCATE ... CASCADE. Trying to list all the partitions individually is pointless: alvherre=# truncate table prim, partfk, partfk1, partfk2; ERROR: cannot truncate a table referenced in a foreign key constraint DETALLE: Table "partfk" references "prim". SUGERENCIA: Truncate table "partfk" at the same time, or use TRUNCATE ... CASCADE. CASCADE is also useless: alvherre=# truncate table prim cascade; NOTICE: truncate cascades to table "partfk" NOTICE: truncate cascades to table "partfk1" NOTICE: truncate cascades to table "partfk2" ERROR: cannot truncate a table referenced in a foreign key constraint DETALLE: Table "partfk" references "prim". SUGERENCIA: Truncate table "partfk" at the same time, or use TRUNCATE ... CASCADE. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: shared-memory based stats collector
On 2018-07-10 14:52:13 +0200, Tomas Vondra wrote: > There's one more reason why attempts to keep stats snapshots "perfectly" > consistent are likely doomed to fail - the messages are sent over UDP, which > does not guarantee delivery etc. So there's always some level of possible > inconsistency even with "perfectly consistent" snapshots. FWIW, I don't see us continuing to do so if we go for a shared hashtable for stats. - Andres
Shared buffer access rule violations?
Hi, In order to make changes to a shared buffer, one must hold a pin on it and the content lock in exclusive mode. This rule seems to be followed in most of the places but there are a few exceptions. One can find several PageInit() calls with no content lock held. See, for example: fill_seq_with_data() vm_readbuf() fsm_readbuf() Moreover, fsm_vacuum_page() performs "PageGetContents(page))->fp_next_slot = 0;" without content lock. There may be more but I want to know if these can be treated as violations before moving ahead. Asim
Re: no partition pruning when partitioning using array type
On 2018/07/11 3:18, Alvaro Herrera wrote: > On 2018-May-08, Amit Langote wrote: > >> In HEAD, since we already fixed that case in e5dcbb88a15d [1] which is a >> different piece of code anyway, the patch only serves to improve the >> deparse output emitted by ruleutils.c for partition constraint expressions >> where pseudo-type partition key is involved. The change can be seen in >> the updated test output for create_table test. > > Actually, even in 11/master it also fixes this case: > > alvherre=# explain update p set a = a || a where a = '{1}'; > QUERY PLAN > ── > Update on p (cost=0.00..54.03 rows=14 width=38) >Update on p1 >Update on p2 >-> Seq Scan on p1 (cost=0.00..27.02 rows=7 width=38) > Filter: (a = '{1}'::integer[]) >-> Seq Scan on p2 (cost=0.00..27.02 rows=7 width=38) > Filter: (a = '{1}'::integer[]) > (7 filas) > > Because UPDATE uses the predtest.c prune code, not partprune. So it's > not just some ruleutils beautification. That's true. Shame I totally missed that. Thanks, Amit
Re: Concurrency bug in UPDATE of partition-key
On 2018-Jul-09, Amit Kapila wrote: > Alvaro, > > Can you please comment whether this addresses your concern? I was thinking that it would be a matter of passing the tuple slot to EvalPlanQual for it to fill, rather than requiring it to fill some other slot obtained from who-knows-where, but I realize now that that's nigh impossible. Thanks for the explanation and patience. What bothers me about this whole business is how ExecBRDeleteTriggers and ExecDelete are now completely different from their sibling routines, but maybe there's no helping that. Please move the output arguments at the end of argument lists; also, it would be great if you add commentary about ExecDelete other undocumented arguments (tupleDeleted in particular) while you're in the vicinity. Thanks -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
RE: How can we submit code patches that implement our (pending) patents?
From: Dave Page [mailto:dp...@pgadmin.org] > SFLC have acted as the projects counsel in the past, so I'm not surprised > they aren't talking to you; you won't be a known contact to them as a PG > contributor, and as a Fujitsu employee there would likely be a conflict > of interest for them to talk to you. I see. Then I hope for some reply saying so so that I can give up my hope that I might get a good idea from them... Who is a known contact to SFLC in PostgreSQL community? Can we expect response from SFLC if he/she contacts them? Regards Takayuki Tsunakawa
Re: _isnan() on Windows
On Tue, Jul 10, 2018 at 04:23:42PM -0400, Tom Lane wrote: > Alvaro Herrera writes: >> On 2018-Jul-10, Tom Lane wrote: >>> I disagree --- including in c.h, as this would have us do, >>> seems like a huge expansion of the visibility of that header. Moreover, >>> doing that only on Windows seems certain to lead to missing-header >>> problems in the reverse direction, ie patches developed on Windows >>> will fail elsewhere. > >> I don't think so, because that's only done for MSVC older than 2013. >> Nobody uses that for new development anymore. > > Hm. OK, maybe it's all right given that. I'm still a bit worried > about downsides, but no doubt the buildfarm will tell us. It seems to me that this patch is a good idea. Any objections if I take care of it? I have a Windows VM with only MSVC 2015 if I recall correctly though... -- Michael signature.asc Description: PGP signature
Re: no partition pruning when partitioning using array type
On 2018/07/11 4:48, Alvaro Herrera wrote: > On 2018-Jul-10, Alvaro Herrera wrote: > >> alvherre=# explain update p set a = a || a where a = '{1}'; >> QUERY PLAN >> ── >> Update on p (cost=0.00..54.03 rows=14 width=38) >>Update on p1 >>Update on p2 >>-> Seq Scan on p1 (cost=0.00..27.02 rows=7 width=38) >> Filter: (a = '{1}'::integer[]) >>-> Seq Scan on p2 (cost=0.00..27.02 rows=7 width=38) >> Filter: (a = '{1}'::integer[]) >> (7 filas) >> >> Because UPDATE uses the predtest.c prune code, not partprune. So it's >> not just some ruleutils beautification. > > I added this test, modified some comments, and pushed. > > Thanks for the patch. Thank you for committing. Regards, Amit
RE: How can we submit code patches that implement our (pending) patents?
From: Tom Lane [mailto:t...@sss.pgh.pa.us] > The core team has considered this matter, and has concluded that it's > time to establish a firm project policy that we will not accept any code > that is known to be patent-encumbered. The long-term legal risks and > complications involved in doing that seem insurmountable, given the > community's amorphous legal nature and the existing Postgres license > wording (neither of which are open for negotiation here). Furthermore, > Postgres has always been very friendly to creation of closed-source > derivatives, but it's hard to see how inclusion of patented code would > not cause serious problems for those. The potential benefits of > accepting patented code just don't seem to justify trying to navigate > these hazards. That decision is very unfortunate as a corporate employee on one hand, but the firm cleanness looks a bit bright to me as one developer. As a practical matter, when and where are you planning to post the project policy? How would you check and prevent patented code? Regards Takayuki Tsunakawa
Re: Excessive PostmasterIsAlive calls slow down WAL redo
On Tue, Jul 10, 2018 at 11:39 PM, Heikki Linnakangas wrote: > The 'postmaster_possibly_dead' flag is not reset anywhere. So if a process > receives a spurious death signal, even though postmaster is still alive, > PostmasterIsAlive() will continue to use the slow path. +1 > postmaster_possibly_dead needs to be marked as 'volatile', no? +1 > The autoconf check for PR_SET_PDEATHSIG seems slightly misplaced. And I > think we can simplify it with AC_CHECK_HEADER(). I'd also like to avoid > adding code to c.h for this, that seems too global. +1, much nicer, thanks. > After some kibitzing, I ended up with the attached. It fixes the > postmaster_possible_dead issues mentioned above, and moves around the > autoconf and #ifdef logic a bit to make it a bit nicer, at least in my > opinion. Thanks, that looks good to me. I added your name as co-author and pushed to master. I also made a couple of minor cosmetic changes in PostmasterDeathSignalInit() to make the follow-up patch prettier (#if defined() instead of #ifdef, and a signum variable because I later need its address). > I don't have a FreeBSD machine at hand, so I didn't try fixing that > patch. I updated the FreeBSD version to use the header test approach you showed, and pushed that too. FWIW the build farm has some FreeBSD animals with and without PROC_PDEATHSIG_CTL. I suppose it's possibly that we might want to reconsider the choice of signal in the future (SIGINFO or SIGPWR). (Random archeological note: TIL that Linux stole from Irix (RIP), but it had PR_TERMCHILD instead of PR_SET_PRDEATHSIG.) -- Thomas Munro http://www.enterprisedb.com
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 6 July 2018 at 21:25, Kato, Sho wrote: > 2. 11beta2 + patch1 + patch2 > > patch1: Allow direct lookups of AppendRelInfo by child relid > commit 7d872c91a3f9d49b56117557cdbb0c3d4c620687 > patch2: 0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch > > part_num | tps_ex| latency_avg | update_latency | select_latency | > insert_latency > --+-+-+++ > 100 | 1224.430344 | 0.817 | 0.551 | 0.085 | > 0.048 > 200 | 689.567511 |1.45 | 1.12 | 0.119 | > 0.05 > 400 | 347.876616 | 2.875 | 2.419 | 0.185 | > 0.052 > 800 | 140.489269 | 7.118 | 6.393 | 0.329 | > 0.059 > 1600 | 29.681672 | 33.691 | 31.272 | 1.517 | > 0.147 > 3200 |7.021957 | 142.412 | 136.4 | 4.033 | > 0.214 > 6400 |1.462949 | 683.557 |669.187 | 7.677 | > 0.264 Hi, Thanks a lot for benchmarking this. Just a note to say that the "Allow direct lookups of AppendRelInfo by child relid" patch is already in master. It's much more relevant to be testing with master than pg11. This patch is not intended for pg11. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: user-friendliness improvement of pageinspect
On Tue, Jul 10, 2018 at 6:44 PM, Yang Jie wrote: > my question is not split the data into individual attributes. > I want to see the data in the table, but I don't want to be a bytea type. What's wrong with simply using an SQL query? -- Peter Geoghegan
回复: user-friendliness improvement of pageinspect
Thank you for your answer. my question is not split the data into individual attributes. I want to see the data in the table, but I don't want to be a bytea type. 在2018年7月11日 02:48,Peter Geoghegan 写道: On Tue, Jul 10, 2018 at 12:41 AM, 杨杰 wrote: Why does the heap_page_item () of the pageinspect extension not consider providing better user-friendliness? My test table has the following data, and when I look at the t_data I see data of type bytea instead of a more intuitive type, even the same type as the original table. tuple_data_split() can do this (split the data into individual attributes). If you want a friendlier, more visual way of getting this information, then you may find pg_hexedit helpful: https://github.com/petergeoghegan/pg_hexedit -- Peter Geoghegan
Re: no partition pruning when partitioning using array type
On 2018/07/11 4:50, Alvaro Herrera wrote: > On 2018-Jul-10, Tom Lane wrote: > >> And what about those partition bound values? They are now illegal >> for the domain, so I would expect a dump/reload to fail, regardless >> of whether there are any values in the table. > > Hmm, true. There is a patch to overhaul how partition bound values are parsed: https://commitfest.postgresql.org/18/1620/ With that patch, the error you reported upthread goes away (that is, one can successfully create partitions), but the problem that Tom mentioned above then appears. What's the solution here then? Prevent domains as partition key? Thanks, Amit
Re: [HACKERS] SERIALIZABLE with parallel query
On Mon, Jul 2, 2018 at 3:12 PM, Masahiko Sawada wrote: > On Fri, Jun 29, 2018 at 7:28 PM, Thomas Munro > wrote: >> On Thu, Jun 28, 2018 at 7:55 PM, Masahiko Sawada >> wrote: >>> I'd like to test and review this patches but they seem to conflict >>> with current HEAD. Could you please rebase them? >> >> Hi Sawada-san, >> >> Thanks! Rebased and attached. The only changes are: the LWLock >> tranche is now shown as "serializable_xact" instead of "sxact" (hmm, >> LWLock tranches have lower_case_names_with_underscores, but individual >> LWLocks have CamelCaseName...), and ShareSerializableXact() no longer >> does Assert(!IsParallelWorker()). These changes are based on the last >> feedback from Robert. >> > > Thank you! Will look at patches. > I looked at this patches. The latest patch can build without any errors and warnings and pass all regression tests. I don't see critical bugs but there are random comments. + /* +* If the leader in a parallel query earler stashed a partially +* released SERIALIZABLEXACT for final clean-up at end of transaction +* (because workers might still have been accessing it), then it's +* time to restore it. +*/ There is a typo. s/earler/earlier/ Should we add test to check if write-skew[1] anomaly doesn't happen even in parallel mode? - * We aren't acquiring lightweight locks for the predicate lock or lock + * We acquire an LWLock in the case of parallel mode, because worker + * backends have access to the leader's SERIALIZABLEXACT. Otherwise, + * we aren't acquiring lightweight locks for the predicate lock or lock There are LWLock and lightweight lock. Maybe it's better to unify the spelling. @@ -2231,9 +2234,12 @@ PrepareTransaction(void) /* * Mark serializable transaction as complete for predicate locking * purposes. This should be done as late as we can put it and still allow -* errors to be raised for failure patterns found at commit. +* errors to be raised for failure patterns found at commit. This is not +* appropriate for parallel workers however, because we aren't committing +* the leader's transaction and its serializable state will live on. */ - PreCommit_CheckForSerializationFailure(); + if (!IsParallelWorker()) + PreCommit_CheckForSerializationFailure(); This code assumes that parallel workers could prepare transaction. Is that expected behavior of parallel query? There is an assertion !IsInParallelMode() at the beginning of that function though. +/* + * If the transaction is committing, but it has been partially released + * already, then treat this as a roll back. It was marked as rolled back. + */ +if (isCommit && SxactIsPartiallyReleased(MySerializableXact)) +isCommit = false; + Isn't it better to add an assertion to check if MySerializableXact->flags has SXACT_FLAG_ROLLED_BACK flag for safety? [1] https://en.wikipedia.org/wiki/Snapshot_isolation#Definition Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: user-friendliness improvement of pageinspect
On Tue, Jul 10, 2018 at 6:55 PM, Yang Jie wrote: > Simple queries are fine, but pageinspect can query previous data. If you > return the same result as a simple query, this is similar to the oracle > flashback version query. Please don't top-post. That may be true in some limited sense, but querying old data like that seems totally impractical. You might do something like that for forensic purposes, as a once off thing, but you didn't say anything about your high level goals here. -- Peter Geoghegan
Re: user-friendliness improvement of pageinspect
Simple queries are fine, but pageinspect can query previous data. If you return the same result as a simple query, this is similar to the oracle flashback version query. On 7/11/2018 09:51,Peter Geoghegan wrote: On Tue, Jul 10, 2018 at 6:44 PM, Yang Jie wrote: my question is not split the data into individual attributes. I want to see the data in the table, but I don't want to be a bytea type. What's wrong with simply using an SQL query? -- Peter Geoghegan