Re: Multivariate MCV list vs. statistics target
On Tue, 18 Jun 2019 at 22:34, Tomas Vondra wrote: > > One slightly inconvenient thing I realized while playing with the > address data set is that it's somewhat difficult to set the desired size > of the multi-column MCV list. > > At the moment, we simply use the maximum statistic target for attributes > the MCV list is built on. But that does not allow keeping default size > for per-column stats, and only increase size of multi-column MCV lists. > > So I'm thinking we should allow tweaking the statistics for extended > stats, and serialize it in the pg_statistic_ext catalog. Any opinions > why that would be a bad idea? > Seems reasonable to me. This might not be the only option we'll ever want to add though, so perhaps a "stxoptions text[]" column along the lines of a relation's reloptions would be the way to go. > I suppose it should be part of the CREATE STATISTICS command, but I'm > not sure what'd be the best syntax. We might also have something more > similar to ALTER COLUMNT, but perhaps > > ALTER STATISTICS s SET STATISTICS 1000; > > looks a bit too weird. > Yes it does look a bit weird, but that's the natural generalisation of what we have for per-column statistics, so it's probably preferable to do that rather than invent some other syntax that wouldn't be so consistent. Regards, Dean
Disconnect from SPI manager on error
A patch fixing this bug https://www.postgresql.org/message-id/flat/15738-21723084f3009ceb%40postgresql.org From 0144733c9f128108670f3654605f274928d83096 Mon Sep 17 00:00:00 2001 From: RekGRpth Date: Fri, 26 Apr 2019 15:35:30 +0500 Subject: Disconnect from SPI manager on error diff --git a/src/pl/plpgsql/src/pl_handler.c b/src/pl/plpgsql/src/pl_handler.c index 61452d9f7f..387b283b03 100644 --- a/src/pl/plpgsql/src/pl_handler.c +++ b/src/pl/plpgsql/src/pl_handler.c @@ -267,6 +267,13 @@ plpgsql_call_handler(PG_FUNCTION_ARGS) /* Decrement use-count, restore cur_estate, and propagate error */ func->use_count--; func->cur_estate = save_cur_estate; + + /* + * Disconnect from SPI manager + */ + if ((rc = SPI_finish()) != SPI_OK_FINISH) + elog(ERROR, "SPI_finish failed: %s", SPI_result_code_string(rc)); + PG_RE_THROW(); } PG_END_TRY(); @@ -364,6 +371,12 @@ plpgsql_inline_handler(PG_FUNCTION_ARGS) /* ... so we can free subsidiary storage */ plpgsql_free_function_memory(func); + /* + * Disconnect from SPI manager + */ + if ((rc = SPI_finish()) != SPI_OK_FINISH) + elog(ERROR, "SPI_finish failed: %s", SPI_result_code_string(rc)); + /* And propagate the error */ PG_RE_THROW(); }
unlogged sequences
The discussion in bug #15631 revealed that serial/identity sequences of temporary tables should really also be temporary (easy), and that serial/identity sequences of unlogged tables should also be unlogged. But there is no support for unlogged sequences, so I looked into that. If you copy the initial sequence relation file to the init fork, then this all seems to work out just fine. Attached is a patch. The low-level copying seems to be handled quite inconsistently across the code, so I'm not sure what the most appropriate way to do this would be. I'm looking for feedback from those who have worked on tableam and storage manager to see what the right interfaces are or whether some new interfaces might perhaps be appropriate. (What's still missing in this patch is ALTER SEQUENCE SET {LOGGED|UNLOGGED} as well as propagating the analogous ALTER TABLE command to owned sequences.) -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services From c15d00d102a7dc2e13d7cc239b27cdcd189bac19 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Sat, 4 May 2019 17:04:45 +0200 Subject: [PATCH v1 1/4] Make command order in test more sensible --- src/test/regress/expected/sequence.out | 2 +- src/test/regress/sql/sequence.sql | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/src/test/regress/expected/sequence.out b/src/test/regress/expected/sequence.out index a0d2b22d3c..75eb5015cf 100644 --- a/src/test/regress/expected/sequence.out +++ b/src/test/regress/expected/sequence.out @@ -599,7 +599,6 @@ DROP SEQUENCE seq2; -- should fail SELECT lastval(); ERROR: lastval is not yet defined in this session -CREATE USER regress_seq_user; -- Test sequences in read-only transactions CREATE TEMPORARY SEQUENCE sequence_test_temp1; START TRANSACTION READ ONLY; @@ -623,6 +622,7 @@ SELECT setval('sequence_test2', 1); -- error ERROR: cannot execute setval() in a read-only transaction ROLLBACK; -- privileges tests +CREATE USER regress_seq_user; -- nextval BEGIN; SET LOCAL SESSION AUTHORIZATION regress_seq_user; diff --git a/src/test/regress/sql/sequence.sql b/src/test/regress/sql/sequence.sql index a7b9e63372..7928ee23ee 100644 --- a/src/test/regress/sql/sequence.sql +++ b/src/test/regress/sql/sequence.sql @@ -272,8 +272,6 @@ CREATE SEQUENCE seq2; -- should fail SELECT lastval(); -CREATE USER regress_seq_user; - -- Test sequences in read-only transactions CREATE TEMPORARY SEQUENCE sequence_test_temp1; START TRANSACTION READ ONLY; @@ -287,6 +285,8 @@ CREATE TEMPORARY SEQUENCE sequence_test_temp1; -- privileges tests +CREATE USER regress_seq_user; + -- nextval BEGIN; SET LOCAL SESSION AUTHORIZATION regress_seq_user; base-commit: 414cca40d506dd3f17b49ae3139853139192c2ba -- 2.22.0 From a66fbef1fc05bd4d5907851bf93c4e268cd477ab Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Tue, 11 Jun 2019 16:40:41 +0200 Subject: [PATCH v1 2/4] Fix comment The last argument of smgrextend() was renamed from isTemp to skipFsync in debcec7dc31a992703911a9953e299c8d730c778, but the comments at two call sites were not updated. --- src/backend/access/heap/rewriteheap.c | 7 +++ src/backend/catalog/storage.c | 6 +++--- 2 files changed, 6 insertions(+), 7 deletions(-) diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c index 369694fa2e..916231e2c4 100644 --- a/src/backend/access/heap/rewriteheap.c +++ b/src/backend/access/heap/rewriteheap.c @@ -703,10 +703,9 @@ raw_heap_insert(RewriteState state, HeapTuple tup) true); /* -* Now write the page. We say isTemp = true even if it's not a -* temp table, because there's no need for smgr to schedule an -* fsync for this write; we'll do it ourselves in -* end_heap_rewrite. +* Now write the page. We say skipFsync = true because there's no +* need for smgr to schedule an fsync for this write; we'll do it +* ourselves in end_heap_rewrite. */ RelationOpenSmgr(state->rs_new_rel); diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c index 3cc886f7fe..d6112fa535 100644 --- a/src/backend/catalog/storage.c +++ b/src/backend/catalog/storage.c @@ -358,9 +358,9 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst, PageSetChecksumInplace(page, blkno); /* -* Now write the page. We say isTemp = true even if it's not a temp -* rel, because there's no need for smgr to schedule an fsync for this -* write; we'll do it ourselves below. +* Now write the page. We say skipFsync = true because there's no +
Re: Implementing Incremental View Maintenance
Hi hackers, Thank you for your many questions and feedbacks at PGCon 2019. Attached is the patch rebased for the current master branch. Regards, Yugo Nagata On Tue, 14 May 2019 15:46:48 +0900 Yugo Nagata wrote: > On Mon, 1 Apr 2019 12:11:22 +0900 > Yugo Nagata wrote: > > > On Thu, 27 Dec 2018 21:57:26 +0900 > > Yugo Nagata wrote: > > > > > Hi, > > > > > > I would like to implement Incremental View Maintenance (IVM) on > > > PostgreSQL. > > > > I am now working on an initial patch for implementing IVM on PostgreSQL. > > This enables materialized views to be updated incrementally after one > > of their base tables is modified. > > Attached is a WIP patch of Incremental View Maintenance (IVM). > Major part is written by me, and changes in syntax and pg_class > are Hoshiai-san's work. > > Although this is sill a draft patch in work-in-progress, any > suggestions or thoughts would be appreciated. > > * What it is > > This allows a kind of Immediate Maintenance of materialized views. if a > materialized view is created by CRATE INCREMENTAL MATERIALIZED VIEW command, > the contents of the mateview is updated automatically and incrementally > after base tables are updated. Noted this syntax is just tentative, so it > may be changed. > > == Example 1 == > postgres=# CREATE INCREMENTAL MATERIALIZED VIEW m AS SELECT * FROM t0; > SELECT 3 > postgres=# SELECT * FROM m; > i > --- > 3 > 2 > 1 > (3 rows) > > postgres=# INSERT INTO t0 VALUES (4); > INSERT 0 1 > postgres=# SELECt * FROM m; -- automatically updated > i > --- > 3 > 2 > 1 > 4 > (4 rows) > = > > This implementation also supports matviews including duplicate tuples or > DISTINCT clause in its view definition query. For example, even if a matview > is defined with DISTINCT to remove duplication of tuples in a base table, this > can perform incremental update of the matview properly. That is, the contents > of the matview doesn't change when exiting tuples are inserted into the base > tables, and a tuple in the matview is deleted only when duplicity of the > corresponding tuple in the base table becomes zero. > > This is due to "colunting alogorithm" in which the number of each tuple is > stored in matviews as a special column value. > > == Example 2 == > postgres=# SELECT * FROM t1; > id | t > +--- > 1 | A > 2 | B > 3 | C > 4 | A > (4 rows) > > postgres=# CREATE INCREMENTAL MATERIALIZED VIEW m1 AS SELECT t FROM t1; > SELECT 3 > postgres=# CREATE INCREMENTAL MATERIALIZED VIEW m2 AS SELECT DISTINCT t FROM > t1; > SELECT 3 > postgres=# SELECT * FROM m1; -- with duplicity > t > --- > A > A > C > B > (4 rows) > > postgres=# SELECT * FROM m2; > t > --- > A > B > C > (3 rows) > > postgres=# INSERT INTO t1 VALUES (5, 'B'); > INSERT 0 1 > postgres=# DELETE FROM t1 WHERE id IN (1,3); -- delete (1,A),(3,C) > DELETE 2 > postgres=# SELECT * FROM m1; -- one A left and one more B > t > --- > B > B > A > (3 rows) > > postgres=# SELECT * FROM m2; -- only C is removed > t > --- > B > A > (2 rows) > = > > * How it works > > 1. Creating matview > > When a matview is created, AFTER triggers are internally created > on its base tables. When the base tables is modified (INSERT, DELETE, > UPDATE), the matview is updated incrementally in the trigger function. > > When populating the matview, GROUP BY and count(*) are added to the > view definition query before this is executed for counting duplicity > of tuples in the matview. The result of count is stored in the matview > as a special column named "__ivm_count__". > > 2. Maintenance of matview > > When base tables are modified, the change set of the table can be > referred as Ephemeral Named Relations (ENRs) thanks to Transition Table > (a feature of trigger implemented since PG10). We can calculate the diff > set of the matview by replacing the base table in the view definition > query with the ENR (at least if it is Selection-Projection -Join view). > As well as view definition time, GROUP BY and count(*) is added in order > to count the duplicity of tuples in the diff set. As a result, two diff > sets (to be deleted from and to be inserted into the matview) are > calculated, and the results are stored into temporary tables respectively. > > The matiview is updated by merging these change sets. Instead of executing > DELETE or INSERT simply, the values of __ivm_count__ column in the matview > is decreased or increased. When the values becomes zero, the corresponding > tuple is deleted from the matview. > > 3. Access to matview > > When SELECT is issued for IVM matviews defined with DISTINCT, all columns > except to __ivm_count__ of each tuple in the matview are returned. This is > correct because duplicity of tuples are eliminated by GROUP BY. > > When DISTINCT is not used, SELECT for the IVM matviews returns each tuple > __ivm_count__ times. Currently, this is implemented by
Re: POC: Cleaning up orphaned files using undo logs
On Wed, Jun 19, 2019 at 11:35 PM Robert Haas wrote: > > On Tue, Jun 18, 2019 at 2:07 PM Robert Haas wrote: > > On Tue, Jun 18, 2019 at 7:31 AM Amit Kapila wrote: > > > [ new patches ] > > > > I tried writing some code [ to use these patches ]. > > I spent some more time experimenting with this patch set today and I > think that the UndoFetchRecord interface is far too zheap-centric. I > expected that I would be able to do this: > > UnpackedUndoRecord *uur = UndoFetchRecord(urp); > // do stuff with uur > UndoRecordRelease(uur); > > But I can't, because the UndoFetchRecord API requires me to pass not > only an undo record but also a block number, an offset number, an XID, > and a callback. I think I could get the effect that I want by > defining a callback that always returns true. Then I could do: > > UndoRecPtr junk; > UnpackedUndoRecord *uur = UndoFetchRecord(urp, InvalidBlockNumber, > InvalidOffsetNumber, &junk, always_returns_true); > // do stuff with uur > UndoRecordRelease(uur); > > That seems ridiculously baroque. I think the most common thing that > an AM will want to do with an UndoRecPtr is look up that exact record; > that is, for example, what zedstore will want to do. However, even if > some AMs, like zheap, want to search backward through a chain of > records, there's no real reason to suppose that all of them will want > to search by block number + offset. They might want to search by some > bit of data buried in the payload, for example. > > I think the basic question here is whether we really need anything > more complicated than: > > extern UnpackedUndoRecord *UndoFetchRecord(UndoRecPtr urp); > > I mean, if you had that, the caller can implement looping easily > enough, and insert any test they want: > > for (;;) > { > UnpackedUndoRecord *uur = UndoFetchRecord(urp); > if (i like this one) > break; > urp = uur->uur_blkprev; // should be renamed, since zedstore + > probably others will have tuple chains not block chains > UndoRecordRelease(uur); > } The idea behind having the loop inside the undo machinery was that while traversing the blkprev chain, we can read all the undo records on the same undo page under one buffer lock. > > The question in my mind is whether there's some performance advantage > of having the undo layer manage the looping rather than the caller do > it. If there is, then there's a lot of zheap code that ought to be > changed to use it, because it's just using the same satisfies-callback > everywhere. If there's not, we should just simplify the undo record > lookup along the lines mentioned above and put all the looping into > the callers. zheap could provide a wrapper around UndoFetchRecord > that does a search by block and offset, so that we don't have to > repeat that logic in multiple places. > > BTW, an actually generic iterator interface would probably look more like > this: > > typedef bool (*SatisfyUndoRecordCallback)(void *callback_data, > UnpackedUndoRecord *uur); Right, it should be this way. > extern UnpackedUndoRecord *UndoFetchRecord(UndoRecPtr urp, UndoRecPtr > *found, SatisfyUndoRecordCallback callback, void *callback_data); > > Now we're not assuming anything about what parts of the record the > callback wants to examine. It can do whatever it likes. Typically > with this sort of interface a caller will define a file-private struct > that is known to both the callback and the caller of UndoFetchRecord, > but not elsewhere. > > If we decide we need an iterator within the undo machinery itself, > then I think it should look like the above, and I think it should > accept NULL for found, callback, and callback_data, so that somebody > who wants to just look up a record, full stop, can do just: > > UnpackedUndoRecord *uur = UndoFetchRecord(urp, NULL, NULL, NULL); > > which seems tolerable. > I agree with this. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Minimal logical decoding on standbys
I am yet to work on Andres's latest detailed review comments, but I thought before that, I should submit a patch for the below reported issue because I was almost ready with the fix. Now I will start to work on Andres's comments, for which I will reply separately. On Fri, 1 Mar 2019 at 13:33, tushar wrote: > > Hi, > > While testing this feature found that - if lots of insert happened on > the master cluster then pg_recvlogical is not showing the DATA > information on logical replication slot which created on SLAVE. > > Please refer this scenario - > > 1) > Create a Master cluster with wal_level=logcal and create logical > replication slot - > SELECT * FROM pg_create_logical_replication_slot('master_slot', > 'test_decoding'); > > 2) > Create a Standby cluster using pg_basebackup ( ./pg_basebackup -D > slave/ -v -R) and create logical replication slot - > SELECT * FROM pg_create_logical_replication_slot('standby_slot', > 'test_decoding'); > > 3) > X terminal - start pg_recvlogical , provide port= ( slave > cluster) and specify slot=standby_slot > ./pg_recvlogical -d postgres -p -s 1 -F 1 -v --slot=standby_slot > --start -f - > > Y terminal - start pg_recvlogical , provide port=5432 ( master > cluster) and specify slot=master_slot > ./pg_recvlogical -d postgres -p 5432 -s 1 -F 1 -v --slot=master_slot > --start -f - > > Z terminal - run pg_bench against Master cluster ( ./pg_bench -i -s 10 > postgres) > > Able to see DATA information on Y terminal but not on X. > > but same able to see by firing this below query on SLAVE cluster - > > SELECT * FROM pg_logical_slot_get_changes('standby_slot', NULL, NULL); > > Is it expected ? Actually it shows up records after quite a long time. In general, walsender on standby is sending each record after significant time (1 sec), and pg_recvlogical shows all the inserted records only after the commit, so for huge inserts, it looks like it is hanging forever. In XLogSendLogical(), GetFlushRecPtr() was used to get the flushed point. On standby, GetFlushRecPtr() does not give a valid value, so it was wrongly determined that the sent record is beyond flush point, as a result of which, WalSndCaughtUp was set to true, causing WalSndLoop() to sleep for some duration after every record. This is why pg_recvlogical appears to be hanging forever in case of huge number of rows inserted. Fix : Use GetStandbyFlushRecPtr() if am_cascading_walsender. Attached patch v8. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company logical-decoding-on-standby_v8.patch Description: Binary data
Re: POC: Cleaning up orphaned files using undo logs
On Thu, Jun 20, 2019 at 2:24 PM Dilip Kumar wrote: > > On Wed, Jun 19, 2019 at 11:35 PM Robert Haas wrote: > > > > On Tue, Jun 18, 2019 at 2:07 PM Robert Haas wrote: > > > On Tue, Jun 18, 2019 at 7:31 AM Amit Kapila > > > wrote: > > > > [ new patches ] > > > > > > I tried writing some code [ to use these patches ]. > > > > I spent some more time experimenting with this patch set today and I > > think that the UndoFetchRecord interface is far too zheap-centric. I > > expected that I would be able to do this: > > > > UnpackedUndoRecord *uur = UndoFetchRecord(urp); > > // do stuff with uur > > UndoRecordRelease(uur); > > > > But I can't, because the UndoFetchRecord API requires me to pass not > > only an undo record but also a block number, an offset number, an XID, > > and a callback. I think I could get the effect that I want by > > defining a callback that always returns true. Then I could do: > > > > UndoRecPtr junk; > > UnpackedUndoRecord *uur = UndoFetchRecord(urp, InvalidBlockNumber, > > InvalidOffsetNumber, &junk, always_returns_true); > > // do stuff with uur > > UndoRecordRelease(uur); > > > > That seems ridiculously baroque. I think the most common thing that > > an AM will want to do with an UndoRecPtr is look up that exact record; > > that is, for example, what zedstore will want to do. However, even if > > some AMs, like zheap, want to search backward through a chain of > > records, there's no real reason to suppose that all of them will want > > to search by block number + offset. They might want to search by some > > bit of data buried in the payload, for example. > > > > I think the basic question here is whether we really need anything > > more complicated than: > > > > extern UnpackedUndoRecord *UndoFetchRecord(UndoRecPtr urp); > > > > I mean, if you had that, the caller can implement looping easily > > enough, and insert any test they want: > > > > for (;;) > > { > > UnpackedUndoRecord *uur = UndoFetchRecord(urp); > > if (i like this one) > > break; > > urp = uur->uur_blkprev; // should be renamed, since zedstore + > > probably others will have tuple chains not block chains > > UndoRecordRelease(uur); > > } > > The idea behind having the loop inside the undo machinery was that > while traversing the blkprev chain, we can read all the undo records > on the same undo page under one buffer lock. > I think if we want we can hold this buffer and allow it to be released in UndoRecordRelease. However, this buffer needs to be stored in some common structure which can be then handed over to UndoRecordRelease. Another thing is that as of now the API allocates the memory just once for UnpackedUndoRecord whereas in the new scheme it needs to be allocated again and again. I think this is a relatively minor thing, but it might be better if we can avoid palloc again and again. BTW, while looking at the code of UndoFetchRecord, I see some problem. There is a coding pattern like if() { } else { LWLockAcquire() .. .. } LWLockRelease(). I think this is not correct. > > > > BTW, an actually generic iterator interface would probably look more like > > this: > > > > typedef bool (*SatisfyUndoRecordCallback)(void *callback_data, > > UnpackedUndoRecord *uur); > Right, it should be this way. > > > extern UnpackedUndoRecord *UndoFetchRecord(UndoRecPtr urp, UndoRecPtr > > *found, SatisfyUndoRecordCallback callback, void *callback_data); > > > > Now we're not assuming anything about what parts of the record the > > callback wants to examine. It can do whatever it likes. Typically > > with this sort of interface a caller will define a file-private struct > > that is known to both the callback and the caller of UndoFetchRecord, > > but not elsewhere. > > > > If we decide we need an iterator within the undo machinery itself, > > then I think it should look like the above, and I think it should > > accept NULL for found, callback, and callback_data, so that somebody > > who wants to just look up a record, full stop, can do just: > > > > UnpackedUndoRecord *uur = UndoFetchRecord(urp, NULL, NULL, NULL); > > > > which seems tolerable. > > > I agree with this. > +1. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: POC: Cleaning up orphaned files using undo logs
On Wed, Jun 19, 2019 at 11:35 PM Robert Haas wrote: > > On Tue, Jun 18, 2019 at 2:07 PM Robert Haas wrote: > > On Tue, Jun 18, 2019 at 7:31 AM Amit Kapila wrote: > > > [ new patches ] > > > > I tried writing some code [ to use these patches ]. > > > for (;;) > { > UnpackedUndoRecord *uur = UndoFetchRecord(urp); > if (i like this one) > break; > urp = uur->uur_blkprev; // should be renamed, since zedstore + > probably others will have tuple chains not block chains .. +1 for renaming this variable. How about uur_prev_ver or uur_prevver or uur_verprev? Any other suggestions? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Tue, Jun 18, 2019 at 5:07 PM shawn wang wrote: > > Masahiko Sawada 于2019年6月17日周一 下午8:30写道: >> >> On Fri, Jun 14, 2019 at 7:41 AM Tomas Vondra >> wrote: >> > I personally find the idea of encrypting tablespaces rather strange. >> > Tablespaces are meant to define hysical location for objects, but this >> > would also use them to "mark" objects as encrypted or not. That just >> > seems misguided and would make the life harder for many users. >> > >> > For example, what if I don't have any tablespaces (except for the >> > default one), but I want to encrypt only some objects? Suddenly I have >> > to create a tablespace, which will however cause various difficulties >> > down the road (during pg_basebackup, etc.). >> >> I guess that we can have an encrypted tabelspace by default (e.g. >> pg_default_enc). Or we encrypt per tables while having encryption keys >> per tablespaces. > > > Hi Sawada-san, > I do agree with it. >> >> >> On Mon, Jun 17, 2019 at 6:54 AM Tomas Vondra >> wrote: >> > >> > On Sun, Jun 16, 2019 at 02:10:23PM -0400, Stephen Frost wrote: >> > >Greetings, >> > > >> > >* Joe Conway (m...@joeconway.com) wrote: >> > >> On 6/16/19 9:45 AM, Bruce Momjian wrote: >> > >> > On Sun, Jun 16, 2019 at 07:07:20AM -0400, Joe Conway wrote: >> > >> >> In any case it doesn't address my first point, which is limiting the >> > >> >> volume encrypted with the same key. Another valid reason is you might >> > >> >> have data at varying sensitivity levels and prefer different keys be >> > >> >> used for each level. >> > >> > >> > >> > That seems quite complex. >> > >> >> > >> How? It is no more complex than encrypting at the tablespace level >> > >> already gives you - in that case you get this property for free if you >> > >> care to use it. >> > > >> > >Perhaps not surprising, but I'm definitely in agreement with Joe >> > >regarding having multiple keys when possible and (reasonably) >> > >straight-forward to do so. I also don't buy off on the OpenSSL >> > >argument; their more severe issues certainly haven't been due to key >> > >management issues such as what we're discussing here, so I don't think >> > >the argument applies. >> > > >> > >> > I'm not sure what exactly is the "OpenSSL argument" you're disagreeing >> > with? IMHO Bruce is quite right that the risk of vulnerabilities grows >> > with the complexity of the system (both due to implementation bugs and >> > general design weaknesses). I don't think it's tied to the key >> > management specifically, except that it's one of the parts that may >> > contribute to the complexity. >> > >> > (It's often claimed that key management is one of the weakest points of >> > current crypto systems - we have safe (a)symmetric algorithms, but safe >> > handling of keys is an issue. I don't have data / papers supporting this >> > claim, I kinda believe it.) >> > >> > Now, I'm not opposed to eventually implementing something more >> > elaborate, but I also think just encrypting the whole cluster (not >> > necessarily with a single key, but with one master key) would be enough >> > for vast majority of users. Plus it's less error prone and easier to >> > operate (backups, replicas, crash recovery, ...). >> > >> > But there's about 0% chance we'll get that in v1, of course, so we need >> > s "minimum viable product" to build on anyway. >> > >> >> I agree that we need minimum viable product first. But I'm not sure >> it's true that the implementing the cluster-wide TDE first could be >> the first step of per-tablespace/table TDE. > > > Yes, we could complete the per-tablespace/table TDE in version 13. > And we could do cluster-wide TDE in the next version. > But I remember you said there are so many keys to manage in the table-level. I think even if we provide the per table encryption we can have encryption keys per tablepaces. That is, all tables on the same tablespace encryption use the same encryption keys but user can control encrypted objects per tables. > Will we add the table-level TDE in the first version? I hope so but It's under discussion now. > And I have two questions. > 1. Will we add hooks to support replacing the encryption algorithms? > 2. Will we add some encryption algorithm or use some in some libraries? Currently the WIP patch uses openssl and supports only AES-256 and I don't have a plan to have such extensibility for now. But it might be a good idea in the future. I think it would be not hard to support symmetric encryption altgorithms supported by openssl but would you like to support other encryption algorithms? Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: UCT (Re: pgsql: Update time zone data files to tzdata release 2019a.)
Greetings, * Tom Lane (t...@sss.pgh.pa.us) wrote: > Also on the topic of process: 48 hours before a wrap deadline is > *particularly* not the time to play fast and loose with this sort of > thing. It'd have been better to wait till after this week's releases, > so there'd at least be time to reconsider if the patch turned out to > have unexpected side-effects. Our typical process for changes that actually end up breaking other things is to put things back the way they were and come up with a better answer. Should we have reverted the code change that caused the issue in the first place, namely, as I understand it at least, the tz code update, to give us time to come up with a better solution and to fix it properly? I'll admit that I wasn't following the thread very closely initially, but I don't recall seeing that even discussed as an option, even though we do it routinely and even had another such case for this set of releases. Possibly a bad assumption on my part, but I did assume that the lack of such a discussion meant that reverting wasn't really an option due to the nature of the changes, leading us into an atypical case already where our usual processes weren't able to be followed. That doesn't mean we should throw the whole thing out the window either, certainly, but I'm not sure that between the 3 options of 'revert', 'live with things being arguably broken', and 'push a contentious commit' that I'd have seen a better option either. I do agree that it would have been better if intentions had been made clearer, such as announcing the plan to push the changes so that we didn't end up with an issue during this patch set (either from out of date zone information, or from having the wrong timezone alias be used), but also with feelings on both sides- if there had been a more explicit "hey, we really need input from someone else on which way they think this should go" ideally with the options spelled out, it would have helped. I don't want to come across as implying that I'm saying what was done was 'fine', or that we shouldn't be having this conversation, I'm just trying to figure out how we can frame it in a way that we learn from it and work to improve on it for the future, should something like this happen again. Thanks, Stephen signature.asc Description: PGP signature
Re: Index Skip Scan
Hi, On 6/19/19 9:57 AM, Dmitry Dolgov wrote: Here is what I was talking about, POC for an integration with index scan. About using of create_skipscan_unique_path and suggested planner improvements, I hope together with Jesper we can come up with something soon. I made some minor changes, but I did move all the code in create_distinct_paths() under enable_indexskipscan to limit the overhead if skip scan isn't enabled. Attached is v20, since the last patch should have been v19. Best regards, Jesper >From 4fd4bd601f510ccce858196c0e93d37aa2d0f20f Mon Sep 17 00:00:00 2001 From: jesperpedersen Date: Thu, 20 Jun 2019 07:42:24 -0400 Subject: [PATCH 1/2] Implementation of Index Skip Scan (see Loose Index Scan in the wiki [1]) on top of IndexScan and IndexOnlyScan. To make it suitable for both situations when there are small number of distinct values and significant amount of distinct values the following approach is taken - instead of searching from the root for every value we're searching for then first on the current page, and then if not found continue searching from the root. Original patch and design were proposed by Thomas Munro [2], revived and improved by Dmitry Dolgov and Jesper Pedersen. [1] https://wiki.postgresql.org/wiki/Loose_indexscan [2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com --- contrib/bloom/blutils.c | 1 + doc/src/sgml/config.sgml | 16 ++ doc/src/sgml/indexam.sgml | 10 + doc/src/sgml/indices.sgml | 24 ++ src/backend/access/brin/brin.c| 1 + src/backend/access/gin/ginutil.c | 1 + src/backend/access/gist/gist.c| 1 + src/backend/access/hash/hash.c| 1 + src/backend/access/index/indexam.c| 16 ++ src/backend/access/nbtree/nbtree.c| 12 + src/backend/access/nbtree/nbtsearch.c | 224 +- src/backend/access/spgist/spgutils.c | 1 + src/backend/commands/explain.c| 24 ++ src/backend/executor/nodeIndexonlyscan.c | 22 ++ src/backend/executor/nodeIndexscan.c | 22 ++ src/backend/nodes/copyfuncs.c | 2 + src/backend/nodes/outfuncs.c | 3 + src/backend/nodes/readfuncs.c | 2 + src/backend/optimizer/path/costsize.c | 1 + src/backend/optimizer/path/pathkeys.c | 84 ++- src/backend/optimizer/plan/createplan.c | 20 +- src/backend/optimizer/plan/planagg.c | 1 + src/backend/optimizer/plan/planner.c | 79 +- src/backend/optimizer/util/pathnode.c | 40 src/backend/optimizer/util/plancat.c | 1 + src/backend/utils/misc/guc.c | 9 + src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/access/amapi.h| 5 + src/include/access/genam.h| 1 + src/include/access/nbtree.h | 5 + src/include/nodes/execnodes.h | 7 + src/include/nodes/pathnodes.h | 8 + src/include/nodes/plannodes.h | 2 + src/include/optimizer/cost.h | 1 + src/include/optimizer/pathnode.h | 5 + src/include/optimizer/paths.h | 4 + src/test/regress/expected/create_index.out| 1 + src/test/regress/expected/select_distinct.out | 209 src/test/regress/expected/sysviews.out| 3 +- src/test/regress/sql/create_index.sql | 2 + src/test/regress/sql/select_distinct.sql | 68 ++ 41 files changed, 918 insertions(+), 22 deletions(-) diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c index ee3bd56274..a88b730f2e 100644 --- a/contrib/bloom/blutils.c +++ b/contrib/bloom/blutils.c @@ -129,6 +129,7 @@ blhandler(PG_FUNCTION_ARGS) amroutine->ambulkdelete = blbulkdelete; amroutine->amvacuumcleanup = blvacuumcleanup; amroutine->amcanreturn = NULL; + amroutine->amskip = NULL; amroutine->amcostestimate = blcostestimate; amroutine->amoptions = bloptions; amroutine->amproperty = NULL; diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 84341a30e5..a1c8a1ea27 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -4400,6 +4400,22 @@ ANY num_sync ( + enable_indexskipscan (boolean) + + enable_indexskipscan configuration parameter + + + + +Enables or disables the query planner's use of index-skip-scan plan +types (see ). This parameter requires +that enable_indexonlyscan is on. +The default is on. + + + + enable_material (boolean) diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml index dd54c68802..c2eb296306 100644 --- a/doc/src/sgml/indexam.sgml +++ b/doc/src/sgml/
old_snapshot_threshold vs indexes
Hello, I ran into someone with a system where big queries scanning 8GB+ of all-in-cache data took consistently ~2.5x longer on a primary server than on a replica. Both servers had concurrent activity on them but plenty of spare capacity and similar specs. After some investigation it turned out that on the primary there were (1) some select() syscalls waiting for 1ms, which might indicate contended SpinLockAcquire() back-offs, and (2) a huge amount of time spent in: + 93,31% 0,00% postgres postgres [.] index_getnext + 93,30% 0,00% postgres postgres [.] index_fetch_heap + 81,66% 0,01% postgres postgres [.] heap_page_prune_opt + 75,85% 0,00% postgres postgres [.] TransactionIdLimitedForOldSnapshots + 75,83% 0,01% postgres postgres [.] RelationHasUnloggedIndex + 75,79% 0,00% postgres postgres [.] RelationGetIndexList + 75,79% 75,78% postgres postgres [.] list_copy The large tables in question have around 30 indexes. I see that heap_page_prune_opt()'s call to TransactionIdLimitedForOldSnapshots() acquires a couple of system-wide spinlocks, and also tests RelationAllowsEarlyPruning() which calls RelationHasUnloggedIndex() which says: * Tells whether any index for the relation is unlogged. * * Note: There doesn't seem to be any way to have an unlogged index attached * to a permanent table, but it seems best to keep this general so that it * returns sensible results even when they seem obvious (like for an unlogged * table) and to handle possible future unlogged indexes on permanent tables. It calls RelationGetIndexList() which conses up a new copy of the list every time, so that we can spin through it looking for unlogged indexes (and in this user's case there are none). I didn't try to poke at this in lab conditions, but from a glance a the code, I guess heap_page_prune_opt() is running for every index tuple except those that reference the same heap page as the one before, so I guess it happens a lot if the heap is not physically correlated with the index keys. Ouch. -- Thomas Munro https://enterprisedb.com
Re: check_recovery_target_lsn() does a PG_CATCH without a throw
On 2019-06-12 13:16, Peter Eisentraut wrote: > I haven't figured out the time zone issue yet, but I guess the solution > might involve moving some of the code from check_recovery_target_time() > to assign_recovery_target_time(). I think that won't work either. What we need to do is postpone the interpretation of the timestamp string until after all the GUC processing is done. So check_recovery_target_time() would just do some basic parsing checks, but stores the string. Then when we need the recovery_target_time_value we do the final parsing. Then we can be sure that the time zone is all set. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On 6/20/19 8:34 AM, Masahiko Sawada wrote: > I think even if we provide the per table encryption we can have > encryption keys per tablepaces. That is, all tables on the same > tablespace encryption use the same encryption keys but user can > control encrypted objects per tables. > >> Will we add the table-level TDE in the first version? > > I hope so but It's under discussion now. +1 >> And I have two questions. >> 1. Will we add hooks to support replacing the encryption algorithms? >> 2. Will we add some encryption algorithm or use some in some libraries? > > Currently the WIP patch uses openssl and supports only AES-256 and I > don't have a plan to have such extensibility for now. But it might be > a good idea in the future. I think it would be not hard to support > symmetric encryption altgorithms supported by openssl but would you > like to support other encryption algorithms? Supporting other symmetric encryption algorithms would be nice but I don't think that is required for the first version. It would also be nice but not initially required to support different encryption libraries. The implementation should be written with both of these eventualities in mind though IMHO. I would like to see strategically placed hooks in the key management so that an extension could, for example, layer another key in between the master key and the per-tablespace key. This would allow extensions to provide additional control over when decryption is allowed. Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development
Re: POC: Cleaning up orphaned files using undo logs
On Thu, Jun 20, 2019 at 2:42 AM Amit Kapila wrote: > Okay, one reason that comes to mind is we don't want to choke the > system as applying undo can consume CPU and generate a lot of I/O. Is > that you have in mind or something else? Yeah, mainly that, but also things like log spam, and even pressure on the lock table. If we are trying over and over again to take useless locks, it can affect other things on the system. The main thing, however, is the CPU and I/O consumption. > I see an advantage in having some sort of throttling here, so we can > have some wait time (say 100ms) between processing requests. Do we > see any need of guc here? I don't think that is the right approach. As I said in my previous reply, we need a way of holding off the retry of the same error for a certain amount of time, probably measured in seconds or tens of seconds. Introducing a delay before processing every request is an inferior alternative: if there are a lot of rollbacks, it can cause the system to lag; and in the case where there's just one rollback that's failing, it will still be way too much log spam (and probably CPU time too). Nobody wants 10 failure messages per second in the log. > > It seems to me that thinking of this in terms of what the undo worker > > does and what the undo launcher does is probably not the right > > approach. We need to think of it more as an integrated system. Instead > > of storing a failure_count with each request in the error queue, how > > about storing a next retry time? > > I think both failure_count and next_retry_time can work in a similar way. > > I think incrementing next retry time in multiples will be a bit > tricky. Say first-time error occurs at X hours. We can say that > next_retry_time will X+10s=Y and error_occured_at will be X. The > second time it again failed, how will we know that we need set > next_retry_time as Y+20s, maybe we can do something like Y-X and then > add 10s to it and add the result to the current time. Now whenever > the worker or launcher finds this request, they can check if the > current_time is greater than or equal to next_retry_time, if so they > can pick that request, otherwise, they check request in next queue. > > The failure_count can also work in a somewhat similar fashion. > Basically, we can use error_occurred at and failure_count to compute > the required time. So, if error is occurred at say X hours and > failure count is 3, then we can check if current_time is greater than > X+(3 * 10s), then we will allow the entry to be processed, otherwise, > it will check other queues for work. Meh. Don't get stuck on one particular method of calculating the next retry time. We want to be able to change that easily if whatever we try first doesn't work out well. I am not convinced that we need anything more complex than a fixed retry time, probably controlled by a GUC (undo_failure_retry_time = 10s?). An escalating time between retries would be important and advantageous if we expected the sizes of these queues to grow into the millions, but the current design seems to be contemplating something more in the tends-of-thousands range and I am not sure we're going to need it at that level. We should try simple things first and then see where we need to make it more complex. At some basic level, the queue needs to be ordered by increasing retry time. You can do that with your design, but you have to recompute the next retry time from the error_occurred_at and failure_count values every time you examine an entry. It's almost certainly better to store the next_retry_time explicitly. That way, if for example we change the logic for computing the next_retry_time to something really complicated, it doesn't have any effect on the code that keeps the queue in order -- it just looks at the computed value. If we end up with something very simple, like error_occurred_at + constant, it may end up seeming a little silly, but I think that's a price well worth paying for code maintainability. If we end up with error_occurred_at + Min(failure_count * 10, 100) or something of that sort, then we can also store failure_count in each record, but it will just be part of the payload, not the sort key, so adding it or removing it won't affect the code that maintains the queue ordering. > > I think the error queue needs to be > > ordered by database_id, then by next_retry_time, and then by order of > > insertion. (The last part is important because next_retry_time is > > going to be prone to having ties, and we need to break those ties in > > the right way.) > > I think it makes sense to order requests by next_retry_time, > error_occurred_at (this will ensure the order of insertion). However, > I am not sure if there is a need to club the requests w.r.t database > id. It can starve the error requests from other databases. Moreover, > we already have a functionality wherein if the undo worker doesn't > encounter the next request from the same data
benchmarking Flex practices
I decided to do some experiments with how we use Flex. The main takeaway is that backtracking, which we removed in 2005, doesn't seem to matter anymore for the core scanner. Also, state table size is of marginal importance. Using the information_schema Flex+Bison microbenchmark from Tom [1], I tested removing most of the "fail" rules designed to avoid backtracking ("decimalfail" is needed by PL/pgSQL). Below are the best times (most runs within 1%), followed by postgres binary size. The numbers are with Flex 2.5.35 on MacOS, no asserts or debugging symbols. HEAD: 1.53s 7139132 bytes HEAD minus "fail" rules (patch attached): 1.53s 6971204 bytes Surprisingly, it has the same performance and a much smaller binary. The size difference is because the size of the elements of the yy_transition array is constrained by the number of elements in the array. Since there are now fewer than INT16_MAX state transitions, the struct members go from 32 bit: struct yy_trans_info { flex_int32_t yy_verify; flex_int32_t yy_nxt; }; static yyconst struct yy_trans_info yy_transition[37045] = ... to 16 bit: struct yy_trans_info { flex_int16_t yy_verify; flex_int16_t yy_nxt; }; static yyconst struct yy_trans_info yy_transition[31763] = ... To test if array size was the deciding factor, I tried bloating it by essentially undoing commit a5ff502fcea. Doing so produced an array with 62583 elements and 32-bit members, so nearly quadruple in size, and it was still not much slower than HEAD: HEAD minus "fail" rules, minus %xusend/%xuiend: 1.56s 7343932 bytes While at it, I repeated the benchmark with different Flex flags: HEAD, plus -Cf: 1.60s 6995788 bytes HEAD, minus "fail" rules, plus -Cf: 1.59s 6979396 bytes HEAD, plus -Cfe: 1.65s 6868804 bytes So this recommendation of the Flex manual (-CF) still holds true. It's worth noting that using perfect hashing for keyword lookup (20% faster) had a much bigger effect than switching from -Cfe to -CF (7% faster). It would be nice to have confirmation to make sure I didn't err somewhere, and to try a more real-world benchmark. (Also for the moment I only have Linux on a virtual machine.) The regression tests pass, but some comments are now wrong. If it's confirmed that backtracking doesn't matter for recent Flex/hardware, disregarding it would make maintenance of our scanners a bit easier. [1] https://www.postgresql.org/message-id/14616.1558560331%40sss.pgh.pa.us -- John Naylorhttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services remove-scanner-fail-rules.patch Description: Binary data
Re: POC: Cleaning up orphaned files using undo logs
On Thu, Jun 20, 2019 at 6:48 AM Amit Kapila wrote: > > for (;;) > > { > > UnpackedUndoRecord *uur = UndoFetchRecord(urp); > > if (i like this one) > > break; > > urp = uur->uur_blkprev; // should be renamed, since zedstore + > > probably others will have tuple chains not block chains > .. > > +1 for renaming this variable. How about uur_prev_ver or uur_prevver > or uur_verprev? Any other suggestions? Maybe just uur_previous or uur_prevundo or something like that. We've already got a uur_prevurp, but that's really pretty misnamed and IMHO it doesn't belong in this structure anyway. (uur_next is also a bad name and also doesn't belong in this structure.) I don't think we want to use 'ver' because that supposes that undo is being used to track tuple versions, which is a likely use but perhaps not the only one. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: benchmarking Flex practices
John Naylor writes: > I decided to do some experiments with how we use Flex. The main > takeaway is that backtracking, which we removed in 2005, doesn't seem > to matter anymore for the core scanner. Also, state table size is of > marginal importance. Huh. That's really interesting, because removing backtracking was a demonstrable, significant win when we did it [1]. I wonder what has changed? I'd be prepared to believe that today's machines are more sensitive to the amount of cache space eaten by the tables --- but that idea seems contradicted by your result that the table size isn't important. (I'm wishing I'd documented the test case I used in 2005...) > The size difference is because the size of the elements of the > yy_transition array is constrained by the number of elements in the > array. Since there are now fewer than INT16_MAX state transitions, the > struct members go from 32 bit: > static yyconst struct yy_trans_info yy_transition[37045] = ... > to 16 bit: > static yyconst struct yy_trans_info yy_transition[31763] = ... Hm. Smaller binary is definitely nice, but 31763 is close enough to 32768 that I'd have little faith in the optimization surviving for long. Is there any way we could buy back some more transitions? > It would be nice to have confirmation to make sure I didn't err > somewhere, and to try a more real-world benchmark. I don't see much wrong with using information_schema.sql as a parser/lexer benchmark case. We should try to confirm the results on other platforms though. regards, tom lane [1] https://www.postgresql.org/message-id/8652.1116865...@sss.pgh.pa.us
Re: improve transparency of bitmap-only heap scans
> Looking at the discussion where the feature was added, I think changing the > EXPLAIN just wasn't considered. I think this is an oversight. It is very useful to have this on EXPLAIN. > The attached patch adds "avoided" to "exact" and "lossy" as a category > under "Heap Blocks". It took me a while to figure out what those names mean. "unfetched", as you call it on the code, may be more descriptive than "avoided" for the new label. However I think the other two are more confusing. It may be a good idea to change them together with this. > I think the name of the node should also be changed to "Bitmap Only Heap > Scan", but I didn't implement that as adding another NodeTag looks like a > lot of tedious error prone work to do before getting feedback on whether > the change is desirable in the first place, or the correct approach. I am not sure about this part. In my opinion it may have been easier to explain to users if "Index Only Scan" had not been separate but "Index Scan" optimization.
Re: POC: Cleaning up orphaned files using undo logs
On Thu, Jun 20, 2019 at 6:44 AM Amit Kapila wrote: > BTW, while looking at the code of UndoFetchRecord, I see some problem. > There is a coding pattern like > if() > { > } > else > { >LWLockAcquire() > .. > .. > } > > LWLockRelease(). > > I think this is not correct. Independently of that problem, I think it's probably bad that we're not maintaining the same shared memory state on the master and the standby. Doing the same check in one way on the master and in a different way on the standby is a recipe for surprising and probably bad behavior differences between master and standby servers. Those could be simple things like lock acquire/release not matching, but they could also be things like performance or correctness differences that only materialize under certain scenarios. This is not the only place in the patch set where we have this kind of thing, and I hate them all. I don't exactly know what the solution is, either, but I suspect it will involve either having the recovery process do a more thorough job updating the shared memory state when it does undo-related stuff, or running some of the undo-specific processes on the standby just for the purpose of getting these updates done. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Disconnect from SPI manager on error
RekGRpth writes: > A patch fixing this bug > https://www.postgresql.org/message-id/flat/15738-21723084f3009ceb%40postgresql.org I do not think this code change is necessary or appropriate. It is not plpgsql's job to clean up after other backend subsystems during a transaction abort. Maybe if plpgsql were the only thing that invokes spi.c, it would be sane to factorize the responsibility this way --- but of course it is not. The complaint in bug #15738 is 100% bogus, which is probably why it was roundly ignored. The quoted C code is just plain wrong about how to handle errors inside the backend. In particular, SPI_rollback is not even approximately the right thing to do to clean up after catching a thrown exception. regards, tom lane
Re: POC: Cleaning up orphaned files using undo logs
On Thu, Jun 20, 2019 at 8:01 PM Robert Haas wrote: > > On Thu, Jun 20, 2019 at 2:42 AM Amit Kapila wrote: > > Okay, one reason that comes to mind is we don't want to choke the > > system as applying undo can consume CPU and generate a lot of I/O. Is > > that you have in mind or something else? > > Yeah, mainly that, but also things like log spam, and even pressure on > the lock table. If we are trying over and over again to take useless > locks, it can affect other things on the system. The main thing, > however, is the CPU and I/O consumption. > > > I see an advantage in having some sort of throttling here, so we can > > have some wait time (say 100ms) between processing requests. Do we > > see any need of guc here? > > I don't think that is the right approach. As I said in my previous > reply, we need a way of holding off the retry of the same error for a > certain amount of time, probably measured in seconds or tens of > seconds. Introducing a delay before processing every request is an > inferior alternative: > This delay is for *not* choking the system by constantly performing undo requests that consume a lot of CPU and I/O as discussed in above point. For holding off the same error request to be re-tried, we need next_retry_time type of method as discussed below. if there are a lot of rollbacks, it can cause > the system to lag; and in the case where there's just one rollback > that's failing, it will still be way too much log spam (and probably > CPU time too). Nobody wants 10 failure messages per second in the > log. > > > > It seems to me that thinking of this in terms of what the undo worker > > > does and what the undo launcher does is probably not the right > > > approach. We need to think of it more as an integrated system. Instead > > > of storing a failure_count with each request in the error queue, how > > > about storing a next retry time? > > > > I think both failure_count and next_retry_time can work in a similar way. > > > > I think incrementing next retry time in multiples will be a bit > > tricky. Say first-time error occurs at X hours. We can say that > > next_retry_time will X+10s=Y and error_occured_at will be X. The > > second time it again failed, how will we know that we need set > > next_retry_time as Y+20s, maybe we can do something like Y-X and then > > add 10s to it and add the result to the current time. Now whenever > > the worker or launcher finds this request, they can check if the > > current_time is greater than or equal to next_retry_time, if so they > > can pick that request, otherwise, they check request in next queue. > > > > The failure_count can also work in a somewhat similar fashion. > > Basically, we can use error_occurred at and failure_count to compute > > the required time. So, if error is occurred at say X hours and > > failure count is 3, then we can check if current_time is greater than > > X+(3 * 10s), then we will allow the entry to be processed, otherwise, > > it will check other queues for work. > > Meh. Don't get stuck on one particular method of calculating the next > retry time. We want to be able to change that easily if whatever we > try first doesn't work out well. I am not convinced that we need > anything more complex than a fixed retry time, probably controlled by > a GUC (undo_failure_retry_time = 10s?). > IIRC, then you only seem to have suggested that we need a kind of back-off algorithm that gradually increases the retry time up to some maximum [1]. I think that is a good way to de-prioritize requests that are repeatedly failing. Say, there is a request that has already failed for 5 times and the worker queues it to get executed after 10s. Immediately after that, another new request has failed for the first time for the same database and it also got queued to get executed after 10s. In this scheme the request that has already failed for 5 times will get a chance before the request that has failed for the first time. [1] - https://www.postgresql.org/message-id/CA%2BTgmoYHBkm7M8tNk6Z9G_aEOiw3Bjdux7v9%2BUzmdNTdFmFzjA%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: benchmarking Flex practices
Hi, On 2019-06-20 10:52:54 -0400, Tom Lane wrote: > John Naylor writes: > > It would be nice to have confirmation to make sure I didn't err > > somewhere, and to try a more real-world benchmark. > > I don't see much wrong with using information_schema.sql as a parser/lexer > benchmark case. We should try to confirm the results on other platforms > though. Might be worth also testing with a more repetitive testcase to measure both cache locality and branch prediction. I assume that with information_schema there's enough variability that these effects play a smaller role. And there's plenty real-world cases where there's a *lot* of very similar statements being parsed over and over. I'd probably just measure the statements pgbench generates or such. Greetings, Andres Freund
Re: UCT (Re: pgsql: Update time zone data files to tzdata release 2019a.)
On Thu, Jun 20, 2019 at 8:52 AM Stephen Frost wrote: > I don't want to come across as implying that I'm saying what was done > was 'fine', or that we shouldn't be having this conversation, I'm just > trying to figure out how we can frame it in a way that we learn from it > and work to improve on it for the future, should something like this > happen again. I agree that it's a difficult situation. I do kind of wonder whether we were altogether overreacting. If we had shipped it as it was, what's the worst thing that would have happened? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: check_recovery_target_lsn() does a PG_CATCH without a throw
Hi, On 2019-06-20 15:42:14 +0200, Peter Eisentraut wrote: > On 2019-06-12 13:16, Peter Eisentraut wrote: > > I haven't figured out the time zone issue yet, but I guess the solution > > might involve moving some of the code from check_recovery_target_time() > > to assign_recovery_target_time(). > > I think that won't work either. What we need to do is postpone the > interpretation of the timestamp string until after all the GUC > processing is done. So check_recovery_target_time() would just do some > basic parsing checks, but stores the string. Then when we need the > recovery_target_time_value we do the final parsing. Then we can be sure > that the time zone is all set. That sounds right to me. Greetings, Andres Freund
Re: UCT (Re: pgsql: Update time zone data files to tzdata release 2019a.)
Hi, On 2019-06-20 12:02:30 -0400, Robert Haas wrote: > On Thu, Jun 20, 2019 at 8:52 AM Stephen Frost wrote: > > I don't want to come across as implying that I'm saying what was done > > was 'fine', or that we shouldn't be having this conversation, I'm just > > trying to figure out how we can frame it in a way that we learn from it > > and work to improve on it for the future, should something like this > > happen again. > > I agree that it's a difficult situation. I do kind of wonder whether > we were altogether overreacting. If we had shipped it as it was, > what's the worst thing that would have happened? I think it's not good, but also nothing particularly bad came out of it. I don't think we should try to set up procedures for future occurances, and rather work/plan on that not happening very often. Greetings, Andres Freund
Re: POC: Cleaning up orphaned files using undo logs
On Thu, Jun 20, 2019 at 11:35 AM Amit Kapila wrote: > This delay is for *not* choking the system by constantly performing > undo requests that consume a lot of CPU and I/O as discussed in above > point. For holding off the same error request to be re-tried, we need > next_retry_time type of method as discussed below. Oh. That's not what I thought we were talking about. It's not unreasonable to think about trying to rate limit undo application just like we do for vacuum, but a fixed delay between requests would be a completely inadequate way of attacking that problem. If the individual requests are short, it will create too much delay, and if they are long, it will not create enough. We would need delays within a transaction, not just between transactions, similar to how the vacuum cost delay stuff works. I suggest that we leave that to one side for now. It seems like something that could be added later, maybe in a more general way, and not something that needs to be or should be closely connected to the logic for deciding the order in which we're going to process different transactions in undo. > > Meh. Don't get stuck on one particular method of calculating the next > > retry time. We want to be able to change that easily if whatever we > > try first doesn't work out well. I am not convinced that we need > > anything more complex than a fixed retry time, probably controlled by > > a GUC (undo_failure_retry_time = 10s?). > > IIRC, then you only seem to have suggested that we need a kind of > back-off algorithm that gradually increases the retry time up to some > maximum [1]. I think that is a good way to de-prioritize requests > that are repeatedly failing. Say, there is a request that has already > failed for 5 times and the worker queues it to get executed after 10s. > Immediately after that, another new request has failed for the first > time for the same database and it also got queued to get executed > after 10s. In this scheme the request that has already failed for 5 > times will get a chance before the request that has failed for the > first time. Sure, that's an advantage of increasing back-off times -- you can keep the stuff that looks hopeless from interfering too much with the stuff that is more likely to work out. However, I don't think we've actually done enough testing to know for sure what algorithm will work out best. Do we want linear back-off (10s, 20s, 30s, ...)? Exponential back-off (1s, 2s, 4s, 8s, ...)? No back-off (10s, 10s, 10s, 10s)? Some algorithm that depends on the size of the failed transaction, so that big things get retried less often? I think it's important to design the code in such a way that the algorithm can be changed easily later, because I don't think we can be confident that whatever we pick for the first attempt will prove to be best. I'm pretty sure that storing the failure count INSTEAD OF the next retry time is going to make it harder to experiment with different algorithms later. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Do we need to do better for pg_ctl timeouts?
Hi, Right now using -w for shutting down clusters with a bit bigger shared buffers will very frequently fail, because the shutdown checkpoint takes much longer than 60s. Obviously that can be addressed by manually setting PGCTLTIMEOUT to something higher, but forcing many users to do that doesn't seem right. And while many users probably don't want to aggressively time-out on the shutdown checkpoint, I'd assume most do want to time out aggressively if the server doesn't actually start the checkpoint. I wonder if we need to split the timeout into two: One value for postmaster to acknowledge the action, one for that action to complete. It seems to me that that'd be useful for all of starting, restarting and stopping. I think we have all the necessary information in the pid file, we would just need to check for PM_STATUS_STARTING for start, PM_STATUS_STOPPING for restart/stop. Comments? Greetings, Andres Freund
commitfest application - create patch doesn't work
Hi Searching subject for "Specify thread msgid" field doesn't work. It returns empty result set every time. Regards Pavel
Re: UCT (Re: pgsql: Update time zone data files to tzdata release 2019a.)
Greetings, * Andres Freund (and...@anarazel.de) wrote: > On 2019-06-20 12:02:30 -0400, Robert Haas wrote: > > On Thu, Jun 20, 2019 at 8:52 AM Stephen Frost wrote: > > > I don't want to come across as implying that I'm saying what was done > > > was 'fine', or that we shouldn't be having this conversation, I'm just > > > trying to figure out how we can frame it in a way that we learn from it > > > and work to improve on it for the future, should something like this > > > happen again. > > > > I agree that it's a difficult situation. I do kind of wonder whether > > we were altogether overreacting. If we had shipped it as it was, > > what's the worst thing that would have happened? > > I think it's not good, but also nothing particularly bad came out of > it. I don't think we should try to set up procedures for future > occurances, and rather work/plan on that not happening very often. Agreed. Thanks, Stephen signature.asc Description: PGP signature
Re: UCT (Re: pgsql: Update time zone data files to tzdata release 2019a.)
On 2019-Jun-20, Andres Freund wrote: > On 2019-06-20 12:02:30 -0400, Robert Haas wrote: > > I agree that it's a difficult situation. I do kind of wonder whether > > we were altogether overreacting. If we had shipped it as it was, > > what's the worst thing that would have happened? > > I think it's not good, but also nothing particularly bad came out of > it. I don't think we should try to set up procedures for future > occurances, and rather work/plan on that not happening very often. I suppose we could have a moratorium on commits starting from (say) EOB Wednesday of the week prior to the release; patches can only be committed after that if they have ample support (where "ample support" might be defined as having +1 from, say, two other committers). That way there's time to discuss/revert/fix anything that is deemed controversial. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: UCT (Re: pgsql: Update time zone data files to tzdata release 2019a.)
On Thu, Jun 20, 2019 at 1:28 PM Alvaro Herrera wrote: > I suppose we could have a moratorium on commits starting from (say) EOB > Wednesday of the week prior to the release; patches can only be > committed after that if they have ample support (where "ample support" > might be defined as having +1 from, say, two other committers). That > way there's time to discuss/revert/fix anything that is deemed > controversial. Or we could have a moratorium on any change at any time that has a -1 from a committer and a +1 from nobody. I mean, your idea is not bad either. I'm just saying. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Tweaking DSM and DSA limits
On Tue, Jun 18, 2019 at 9:08 PM Thomas Munro wrote: > It's currently set to 4, but I now think that was too cautious. It > tries to avoid fragmentation by ramping up slowly (that is, memory > allocated and in some cases committed by the operating system that we > don't turn out to need), but it's pretty wasteful of slots. Perhaps > it should be set to 2? +1. I think I said at the time that I thought that was too cautious... > Perhaps also the number of slots per backend should be dynamic, so > that you have the option to increase it from the current hard-coded > value of 2 if you don't want to increase max_connections but find > yourself running out of slots (this GUC was a request from Andres but > the name was made up by me -- if someone has a better suggestion I'm > all ears). I am not convinced that we really need to GUC-ify this. How about just bumping the value up from 2 to say 5? Between the preceding change and this one we ought to buy ourselves more than 4x, and if that is not enough then we can ask whether raising max_connections is a reasonable workaround, and if that's still not enough then we can revisit this idea, or maybe come up with something better. The problem I have with a GUC here is that nobody without a PhD in PostgreSQL-ology will have any clue how to set it, and while that's good for your employment prospects and mine, it's not so great for PostgreSQL users generally. As Andres observed off-list, it would also be a good idea to allow things that are going to gobble memory like hash joins to have some input into how much memory gets allocated. Maybe preallocating the expected size of the hash is too aggressive -- estimates can be wrong, and it could be much smaller. But maybe we should allocate at least, say, 1/64th of that amount, and act as if DSA_NUM_SEGMENTS_AT_EACH_SIZE == 1 until the cumulative memory allocation is more than 25% of that amount. So if we think it's gonna be 1GB, start by allocating 16MB and double the size of each allocation thereafter until we get to at least 256MB allocated. So then we'd have 16MB + 32MB + 64MB + 128MB + 256MB + 256MB + 512MB = 7 segments instead of the 32 required currently or the 18 required with DSA_NUM_SEGMENTS_AT_EACH_SIZE == 2. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: UCT (Re: pgsql: Update time zone data files to tzdata release 2019a.)
Greetings, * Robert Haas (robertmh...@gmail.com) wrote: > On Thu, Jun 20, 2019 at 1:28 PM Alvaro Herrera > wrote: > > I suppose we could have a moratorium on commits starting from (say) EOB > > Wednesday of the week prior to the release; patches can only be > > committed after that if they have ample support (where "ample support" > > might be defined as having +1 from, say, two other committers). That > > way there's time to discuss/revert/fix anything that is deemed > > controversial. > > Or we could have a moratorium on any change at any time that has a -1 > from a committer and a +1 from nobody. What about a change that's already been committed but another committer feels caused a regression? If that gets a -1, does it get reverted until things are sorted out, or...? In the situation that started this discussion, a change had already been made and it was only later realized that it caused a regression. Piling on to that, the regression was entwined with other important changes that we wanted to include in the release. Having a system where when the commit was made is a driving factor seems like it would potentially reward people who pushed a change early by giving them the upper hand in such a discussion as this. Ultimately though, I still agree with Andres that this is something we should act to avoid these situation and we shouldn't try to make a policy to fit what's been a very rare occurance. If nothing else, I feel like we'd probably re-litigate the policy every time since it would likely have been a long time since the last discussion of it and the specific circumstances will always be at least somewhat different. Thanks, Stephen signature.asc Description: PGP signature
Re: commitfest application - create patch doesn't work
Greetings, * Pavel Stehule (pavel.steh...@gmail.com) wrote: > Searching subject for "Specify thread msgid" field doesn't work. It returns > empty result set every time. Is this still not working? I was chatting with Magnus and it seems possible this was broken and then fixed already. Thanks, Stephen signature.asc Description: PGP signature
Re: psql UPDATE field [tab] expands to DEFAULT?
I wrote: > Nosing around in tab-complete.c, I notice a fair number of other > places where we're doing COMPLETE_WITH() with just a single possible > completion. Knowing what we know now, in each one of those places > libreadline will suppose that that completion is the only syntactically > legal continuation, and throw away anything else the user might've typed. > We should probably inspect each of those places to see if that's really > desirable behavior ... but I didn't muster the energy to do that this > morning. I took a closer look and realized that this isn't some magic behavior of arcane parts of libreadline; it's more like self-inflicted damage. It happens because tab-complete.c's complete_from_const() is doing exactly what its comment says it does: /* * This function returns one fixed string the first time even if it doesn't * match what's there, and nothing the second time. This should be used if * there is only one possibility that can appear at a certain spot, so * misspellings will be overwritten. The string to be passed must be in * completion_charp. */ This is unlike complete_from_list(), which will only return completions that match the text-string-so-far. I have to wonder whether complete_from_const()'s behavior is really a good idea; I think there might be an argument for getting rid of it and using complete_from_list() even for one-element lists. We certainly didn't do anybody any favors in the refactoring we did in 4f3b38fe2, which removed the source-code difference between calling complete_from_const() and calling complete_from_list() with just one list item. But even before that, I really doubt that many people hacking on tab-complete.c had internalized the idea that COMPLETE_WITH_CONST() implied a higher degree of certainty than COMPLETE_WITH_LIST() with one list item. I'm pretty sure I'd never understood that. Both of those functions go back to the beginnings of tab-complete.c, so there's not much available in the history to explain the difference in behavior (and the discussion of the original patch, if any, is lost to the mists of time --- our archives for pgsql-patches only go back to 2000). But my own feeling about this is that there's no situation in which I'd expect tab completion to wipe out text I'd typed. Thoughts? regards, tom lane
Re: Tweaking DSM and DSA limits
Hi, On 2019-06-20 14:20:27 -0400, Robert Haas wrote: > On Tue, Jun 18, 2019 at 9:08 PM Thomas Munro wrote: > > Perhaps also the number of slots per backend should be dynamic, so > > that you have the option to increase it from the current hard-coded > > value of 2 if you don't want to increase max_connections but find > > yourself running out of slots (this GUC was a request from Andres but > > the name was made up by me -- if someone has a better suggestion I'm > > all ears). > > I am not convinced that we really need to GUC-ify this. How about > just bumping the value up from 2 to say 5? I'm not sure either. Although it's not great if the only way out for a user hitting this is to increase max_connections... But we should really increase the default. > As Andres observed off-list, it would also be a good idea to allow > things that are going to gobble memory like hash joins to have some > input into how much memory gets allocated. Maybe preallocating the > expected size of the hash is too aggressive -- estimates can be wrong, > and it could be much smaller. At least for the case of the hashtable itself, we allocate that at the predicted size immediately. So a mis-estimation wouldn't change anything. For the entires, yea, something like you suggest would make sense. Greetings, Andres Freund
Re: commitfest application - create patch doesn't work
čt 20. 6. 2019 v 20:27 odesílatel Stephen Frost napsal: > Greetings, > > * Pavel Stehule (pavel.steh...@gmail.com) wrote: > > Searching subject for "Specify thread msgid" field doesn't work. It > returns > > empty result set every time. > > Is this still not working? I was chatting with Magnus and it seems > possible this was broken and then fixed already. > It is working now. Thank you Pavel > Thanks, > > Stephen >
Re: UCT (Re: pgsql: Update time zone data files to tzdata release 2019a.)
> "Stephen" == Stephen Frost writes: Stephen> In the situation that started this discussion, a change had Stephen> already been made and it was only later realized that it Stephen> caused a regression. Just to keep the facts straight: The regression was introduced by importing tzdb 2019a (in late April) into the previous round of point releases; the change in UTC behaviour was not mentioned in the commit and presumably didn't show up on anyone's radar until there were field complaints (which didn't reach our mailing lists until Jun 4 as far as I know). Tom's "fix" of backpatching 23bd3cec6 (which happened on Friday 14th) addressed only a subset of cases, as far as I know working only on Linux (the historical convention has always been for /etc/localtime to be a copy of a zonefile, not a symlink to one). I only decided to write (and if need be commit) my own followup fix after confirming that the bug was unfixed in a default FreeBSD install when set to UTC, and there was a good chance that a number of other less-popular platforms were affected too. Stephen> Piling on to that, the regression was entwined with other Stephen> important changes that we wanted to include in the release. I'm not sure what you're referring to here? -- Andrew (irc:RhodiumToad)
Re: UCT (Re: pgsql: Update time zone data files to tzdata release 2019a.)
Greetings, * Andrew Gierth (and...@tao11.riddles.org.uk) wrote: > > "Stephen" == Stephen Frost writes: > > Stephen> In the situation that started this discussion, a change had > Stephen> already been made and it was only later realized that it > Stephen> caused a regression. > > Just to keep the facts straight: > > The regression was introduced by importing tzdb 2019a (in late April) Ah, thanks, I had misunderstood when that was committed then. > into the previous round of point releases; the change in UTC behaviour > was not mentioned in the commit and presumably didn't show up on > anyone's radar until there were field complaints (which didn't reach our > mailing lists until Jun 4 as far as I know). Ok. > Tom's "fix" of backpatching 23bd3cec6 (which happened on Friday 14th) > addressed only a subset of cases, as far as I know working only on Linux > (the historical convention has always been for /etc/localtime to be a > copy of a zonefile, not a symlink to one). I only decided to write (and > if need be commit) my own followup fix after confirming that the bug was > unfixed in a default FreeBSD install when set to UTC, and there was a > good chance that a number of other less-popular platforms were affected > too. > > Stephen> Piling on to that, the regression was entwined with other > Stephen> important changes that we wanted to include in the release. > > I'm not sure what you're referring to here? I was referring to the fact that the regression was introduced by a, presumably important, tzdb update (2019a, as mentioned above). At least, I made the assumption that the commit of the import of 2019a had more than just the change that introduced the regression, but I'm happy to admit I'm no where near as close to the code here as you/Tom here. Thanks, Stephen signature.asc Description: PGP signature
Re: UCT (Re: pgsql: Update time zone data files to tzdata release 2019a.)
Stephen Frost writes: > * Andrew Gierth (and...@tao11.riddles.org.uk) wrote: > "Stephen" == Stephen Frost writes: >> Stephen> Piling on to that, the regression was entwined with other >> Stephen> important changes that we wanted to include in the release. >> >> I'm not sure what you're referring to here? I was confused by that too. > I was referring to the fact that the regression was introduced by a, > presumably important, tzdb update (2019a, as mentioned above). At > least, I made the assumption that the commit of the import of 2019a had > more than just the change that introduced the regression, but I'm happy > to admit I'm no where near as close to the code here as you/Tom here. Keep in mind that dealing with whatever tzdb chooses to ship is not optional from our standpoint. Even if we'd refused to import 2019a, every installation using --with-system-tzdata (which, I sincerely hope, includes most production installs) is going to have to deal with it as soon as the respective platform vendor gets around to shipping the tzdata update. So reverting that commit was never on the table. regards, tom lane
Re: UCT (Re: pgsql: Update time zone data files to tzdata release 2019a.)
> "Tom" == Tom Lane writes: >> I was referring to the fact that the regression was introduced by a, >> presumably important, tzdb update (2019a, as mentioned above). At >> least, I made the assumption that the commit of the import of 2019a >> had more than just the change that introduced the regression, but >> I'm happy to admit I'm no where near as close to the code here as >> you/Tom here. Tom> Keep in mind that dealing with whatever tzdb chooses to ship is Tom> not optional from our standpoint. Even if we'd refused to import Tom> 2019a, every installation using --with-system-tzdata (which, I Tom> sincerely hope, includes most production installs) is going to Tom> have to deal with it as soon as the respective platform vendor Tom> gets around to shipping the tzdata update. So reverting that Tom> commit was never on the table. Exactly. But that means that if the combination of our arbitrary rules and the data in the tzdb results in an undesirable result, then we have no real option but to fix our rules (we can't reasonably expect the tzdb upstream to choose zone names to make our alphabetical-order preference come out right). My commit was intended to be the minimum fix that would restore the pre-2019a behavior on all systems. -- Andrew (irc:RhodiumToad)
Re: Tweaking DSM and DSA limits
On Thu, Jun 20, 2019 at 02:20:27PM -0400, Robert Haas wrote: > On Tue, Jun 18, 2019 at 9:08 PM Thomas Munro wrote: > > It's currently set to 4, but I now think that was too cautious. It > > tries to avoid fragmentation by ramping up slowly (that is, memory > > allocated and in some cases committed by the operating system that we > > don't turn out to need), but it's pretty wasteful of slots. Perhaps > > it should be set to 2? > > +1. I think I said at the time that I thought that was too cautious... > > > Perhaps also the number of slots per backend should be dynamic, so > > that you have the option to increase it from the current hard-coded > > value of 2 if you don't want to increase max_connections but find > > yourself running out of slots (this GUC was a request from Andres but > > the name was made up by me -- if someone has a better suggestion I'm > > all ears). > > I am not convinced that we really need to GUC-ify this. How about > just bumping the value up from 2 to say 5? Between the preceding > change and this one we ought to buy ourselves more than 4x, and if > that is not enough then we can ask whether raising max_connections is > a reasonable workaround, Is there perhaps a way to make raising max_connections not require a restart? There are plenty of situations out there where restarts aren't something that can be done on a whim. Best, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Re: UCT (Re: pgsql: Update time zone data files to tzdata release 2019a.)
Andrew Gierth writes: > "Tom" == Tom Lane writes: > Tom> Keep in mind that dealing with whatever tzdb chooses to ship is > Tom> not optional from our standpoint. Even if we'd refused to import > Tom> 2019a, every installation using --with-system-tzdata (which, I > Tom> sincerely hope, includes most production installs) is going to > Tom> have to deal with it as soon as the respective platform vendor > Tom> gets around to shipping the tzdata update. So reverting that > Tom> commit was never on the table. > Exactly. But that means that if the combination of our arbitrary rules > and the data in the tzdb results in an undesirable result, then we have > no real option but to fix our rules (we can't reasonably expect the tzdb > upstream to choose zone names to make our alphabetical-order preference > come out right). My position is basically that having TimeZone come out as 'UCT' rather than 'UTC' (affecting no visible behavior of the timestamp types, AFAIK) was not such a grave problem as to require violating community norms to get it fixed in this week's releases rather than the next batch. I hadn't had time to consider your patch last week because I was (a) busy with release prep and (b) sick as a dog. I figured we could let it slide and discuss it after the release work died down. I imagine the reason you got zero other responses was that nobody else thought it was of life-and-death urgency either. Anyway, as I said already, my beef is not with the substance of the patch but with failing to follow community process. One "yes" vote and one "no" vote do not constitute consensus. You had no business assuming that I would reverse the "no" vote. regards, tom lane
Re: UCT (Re: pgsql: Update time zone data files to tzdata release 2019a.)
Andrew Gierth writes: > Tom's "fix" of backpatching 23bd3cec6 (which happened on Friday 14th) > addressed only a subset of cases, as far as I know working only on Linux > (the historical convention has always been for /etc/localtime to be a > copy of a zonefile, not a symlink to one). I only decided to write (and > if need be commit) my own followup fix after confirming that the bug was > unfixed in a default FreeBSD install when set to UTC, and there was a > good chance that a number of other less-popular platforms were affected > too. I think your info is out of date on that. NetBSD uses a symlink, and has done for at least 5 years: see set_timezone in http://cvsweb.netbsd.org/bsdweb.cgi/src/usr.sbin/sysinst/util.c?only_with_tag=MAIN macOS seems to have done it like that for at least 10 years, too. I didn't bother digging into their source repo, as it's likely that System Preferences isn't open-source; but *all* of my macOS machines have symlinks there, and some of those link files are > 10 years old. I could not easily find OpenBSD's logic to set the zone during install, if they have any; but at least their admin-facing documentation says to create the file as a symlink: https://www.openbsd.org/faq/faq8.html#TimeZone and there are plenty of similar recommendations found by Mr. Google. In short, I think FreeBSD are holdouts not the norm. I note that even their code will preserve /etc/localtime's symlink status if it was a symlink to start with: see install_zoneinfo_file in https://github.com/freebsd/freebsd/blob/master/usr.sbin/tzsetup/tzsetup.c regards, tom lane
Re: Multivariate MCV list vs. statistics target
On Thu, Jun 20, 2019 at 08:08:44AM +0100, Dean Rasheed wrote: On Tue, 18 Jun 2019 at 22:34, Tomas Vondra wrote: One slightly inconvenient thing I realized while playing with the address data set is that it's somewhat difficult to set the desired size of the multi-column MCV list. At the moment, we simply use the maximum statistic target for attributes the MCV list is built on. But that does not allow keeping default size for per-column stats, and only increase size of multi-column MCV lists. So I'm thinking we should allow tweaking the statistics for extended stats, and serialize it in the pg_statistic_ext catalog. Any opinions why that would be a bad idea? Seems reasonable to me. This might not be the only option we'll ever want to add though, so perhaps a "stxoptions text[]" column along the lines of a relation's reloptions would be the way to go. I don't know - I kinda dislike the idea of stashing stuff like this into text[] arrays unless there's a clear need for such flexibility (i.e. vision to have more such options). Which I'm not sure is the case here. And we kinda have a precedent in pg_attribute.attstattarget, so I'd use the same approach here. I suppose it should be part of the CREATE STATISTICS command, but I'm not sure what'd be the best syntax. We might also have something more similar to ALTER COLUMNT, but perhaps ALTER STATISTICS s SET STATISTICS 1000; looks a bit too weird. Yes it does look a bit weird, but that's the natural generalisation of what we have for per-column statistics, so it's probably preferable to do that rather than invent some other syntax that wouldn't be so consistent. Yeah, I agree. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Choosing values for multivariate MCV lists
On Thu, Jun 20, 2019 at 06:55:41AM +0100, Dean Rasheed wrote: On Tue, 18 Jun 2019 at 21:59, Tomas Vondra wrote: The current implementation of multi-column MCV lists (added in this cycle) uses a fairly simple algorithm to pick combinations to include in the MCV list. We just compute a minimum number of occurences, and then include all entries sampled more often. See get_mincount_for_mcv_list(). [snip] This however means that by choosing the MCV entries solely based on the number of occurrences in the sample, we end up with MCV lists where vast majority of entries has NULL street name. That's why we got such poor estimate in the example query, despite the fact that the city/street combination is the most frequent in the data set (with non-NULL street name). I think the fact that they're NULL is a bit of a red herring because we're treating NULL just like any other value. The same thing would happen if there were some other very common non-NULL value that dominated the dataset. I wasn't really suggesting the NULL is an issue - sorry if that wasn't clear. It might be any other value, as long as it's very common (and roughly uniform) in all cities. So yes, I agree with you here. The other weird thing is that frequency of NULL street names is fairly uniform in the whole data set. In total about 50% addresses match that, and for individual cities it's generally between 25% and 100%, so the estimate is less than 2x off in those cases. But for addresses with non-NULL street names, the situation is very different. Some street names are unique to a single city, etc. Overall, this means we end up with MCV list with entries representing the mostly-uniform part of the data set, instead of prefering the entries that are truly skewed. So I'm wondering if we could/should rethink the algorithm, so look at the frequency and base_frequency, and maybe pick entries based on their ratio, or something like that. Hmm, interesting. I think I would like to see a more rigorous justification for changing the algorithm deciding which values to keep. Sure, I'm not going to pretend my proposals were particularly rigorous, it was more a collection of random ideas. If I've understood correctly, I think the problem is this: The mincount calculation is a good way of identifying MCV candidates to keep, because it ensures that we don't keep values that don't appear sufficiently often to produce accurate estimates, and ideally we'd keep everything with count >= mincount. However, in the case were there are more than stats_target items with count >= mincount, simply ordering by count and keeping the most commonly seen values isn't necessarily the best strategy in the case of multivariate statistics. Yes. To identify what the best strategy might be, I think you need to examine the errors that would occur as a result of *not* keeping a value in the multivariate MCV list. Given a value that appears with count >= mincount, N*freq ought to be a reasonable estimate for the actual number of occurrences of that value in the table, and N*base_freq ought to be a reasonable estimate for the univariate-stats-based estimate that it would be given if it weren't kept in the multivariate MCV list. So the absolute error resulting from not keeping that value would be N * Abs(freq - base_freq) Yeah. Considering N is the same for all groups in the sample, this would mean the same thing as Abs(freq - base_freq). But then I think we ought to take into account how often we're likely to get that error. If we're simply picking values at random, the likelihood of getting that value is just its frequency, so the average average absolute error would be Sum( N * freq[i] * Abs(freq[i] - base_freq[i]) ) which suggests that, if we wanted to reduce the average absolute error of the estimates, we should order by freq*Abs(freq-base_freq) and keep the top n in the MCV list. Interesting idea. But I'm not sure it makes much sense to assume the rows are picked randomly - OTOH we don't really know anything about how the data will be queries, so we may just as well do that. On the other hand, if we wanted to reduce the average *relative* error of the estimates, we might instead order by Abs(freq-base_freq). Hmmm, yeah. I don't know what's the right choice here, TBH. For example, we might sort the entries by abs(freq - base_freq) / freq I'm not sure it's easy to justify ordering by Abs(freq-base_freq)/freq though, because that would seem likely to put too much weight on the least commonly occurring values. But would that be an issue, or a good thing? I mean, as long as the item is above mincount, we take the counts as reliable. As I explained, my motivation for proposing that was that both ... (cost=... rows=1 ...) (actual=... rows=101 ...) and ... (cost=... rows=100 ...) (actual=... rows=200 ...) have exactly the same Abs(freq - base_freq), but I think we both agree that the first misestimate is much mor
Re: clean up docs for v12
On 2019-May-20, Paul A Jungwirth wrote: > On Mon, May 20, 2019 at 9:44 PM Amit Langote > wrote: > > This sounds more like an open item to me [1], not something that have to > > be postponed until the next CF. > > > > [1] https://wiki.postgresql.org/wiki/PostgreSQL_12_Open_Items > > Oh sorry, I already created the CF entry. Should I withdraw it? I'll > ask on -infra about getting editor permission for the wiki and add a > note there instead. You didn't actually ask, but I did it anyway. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: clean up docs for v12
This patch was applied as f73293aba4d4. Thanks, Paul and Michael. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: unlogged sequences
On Thu, Jun 20, 2019 at 09:30:34AM +0200, Peter Eisentraut wrote: > The discussion in bug #15631 revealed that serial/identity sequences of > temporary tables should really also be temporary (easy), and that > serial/identity sequences of unlogged tables should also be unlogged. > But there is no support for unlogged sequences, so I looked into that. Thanks for doing so. > If you copy the initial sequence relation file to the init fork, then > this all seems to work out just fine. Attached is a patch. The > low-level copying seems to be handled quite inconsistently across the > code, so I'm not sure what the most appropriate way to do this would be. > I'm looking for feedback from those who have worked on tableam and > storage manager to see what the right interfaces are or whether some new > interfaces might perhaps be appropriate. But the actual deal is that the sequence meta-data is now in pg_sequences and not the init forks, no? I have just done a small test: 1) Some SQL queries: create unlogged sequence popo; alter sequence popo increment 2; select nextval('popo'); select nextval('popo'); 2) Then a hard crash: pg_ctl stop -m immediate pg_ctl start 3) Again, with a crash: select nextval('popo'); #2 0x55ce60f3208d in ExceptionalCondition (conditionName=0x55ce610f0570 "!(((PageHeader) (page))->pd_special >= (__builtin_offsetof (PageHeaderData, pd_linp)))", errorType=0x55ce610f0507 "FailedAssertion", fileName=0x55ce610f04e0 "../../../src/include/storage/bufpage.h", lineNumber=317) at assert.c:54 #3 0x55ce60b43200 in PageValidateSpecialPointer (page=0x7ff7692b3d80 "") at ../../../src/include/storage/bufpage.h:317 #4 0x55ce60b459d4 in read_seq_tuple (rel=0x7ff768ad27e0, buf=0x7ffc5707f0bc, seqdatatuple=0x7ffc5707f0a0) at sequence.c:1213 #5 0x55ce60b447ee in nextval_internal (relid=16385, check_permissions=true) at sequence.c:678 #6 0x55ce60b44533 in nextval_oid (fcinfo=0x55ce62537570) at sequence.c:607 -- Michael signature.asc Description: PGP signature
Re: clean up docs for v12
On Thu, Jun 20, 2019 at 07:34:10PM -0400, Alvaro Herrera wrote: > This patch was applied as f73293aba4d4. Thanks, Paul and Michael. Thanks for the thread update, Alvaro. I completely forgot to mention the commit on this thread. -- Michael signature.asc Description: PGP signature
Re: pg_dump partitions can lead to inconsistent state after restore
On Thu, Apr 25, 2019 at 4:01 AM Alvaro Herrera wrote: > So, while testing this I noticed that pg_restore fails with deadlocks if > you do a parallel restore if the --load-via-partition-root switch was > given to pg_dump. Is that a known bug? Was investigating --load-via-partition-root with a coworker and came across the following note in the documentation: https://www.postgresql.org/docs/11/app-pgdump.html "It is best not to use parallelism when restoring from an archive made with this option, because pg_restore will not know exactly which partition(s) a given archive data item will load data into. This could result in inefficiency due to lock conflicts between parallel jobs, or perhaps even reload failures due to foreign key constraints being set up before all the relevant data is loaded." Apparently, this note was added as a result of the following discussion: https://www.postgresql.org/message-id/flat/13624.1535486019%40sss.pgh.pa.us So, while the documentation doesn't explicitly list deadlocks as possible risk, Tom hinted in the first email that it's possible. I set out to reproduce one and was able to, although I'm not sure if it's the same deadlock as seen by Alvaro. Steps I used to reproduce: # in the source database create table foo (a int primary key); insert into foo select generate_series(1, 100); create table ht (a int) partition by hash (a); select 'create table ht' || i || ' partition of ht for values with (modulus 100, remainder ' || i -1 || ');' from generate_series(1, 100) i; \gexec insert into ht select generate_series(1, 100); alter table ht add foreign key (a) references foo (a); # in shell pg_dump --load-via-partition-root -Fd -f /tmp/dump createdb targetdb pg_restore -d targetdb -j 2 /tmp/dump The last step reports deadlocks; in the server log: ERROR: deadlock detected DETAIL: Process 14213 waits for RowExclusiveLock on relation 17447 of database 17443; blocked by process 14212. Process 14212 waits for ShareRowExclusiveLock on relation 17507 of database 17443; blocked by process 14213. Process 14213: COPY public.ht (a) FROM stdin; Process 14212: ALTER TABLE public.ht ADD CONSTRAINT ht_a_fkey FOREIGN KEY (a) REFERENCES public.foo(a); Here, the process adding the foreign key has got the lock on the parent and trying to lock a partition to add the foreign key to it. The process doing COPY (via root) has apperently locked the partition and waiting for the lock on the parent to do actual copying. Looking into why the latter had got a lock on the partition at all if it hasn't started the copying yet, I noticed that it was locked when TRUNCATE was executed on it earlier in the same transaction as part of some WAL-related optimization, which is something that only happens in the parallel restore mode. I was under the impression that the TABLE DATA archive item (its TocEntry) would have no trace of the partition if it was dumped with --load-via-partition-root, but that's not the case. --load-via-partition-root only dictates that the command that will be dumped for the item will use the root parent as COPY target, even though the TocEntry itself is owned by the partition. Maybe, a way to prevent the deadlock would be for the process that will do copy-into-given-partition-via-root-parent to do a `LOCK TABLE root_parent` before `TRUNCATE the_partition`, but we'll need to get hold of the root parent from the TocEntry somehow. Turns out it's only present in the TocEntry.copyStmt, from where it will have to parsed out. Maybe that's the only thing we could do without breaking the archive format though. Thoughts? By the way, I couldn't think of ways to reproduce any of the hazards mentioned in the documentations of using parallel mode to restore an archive written with pg_dump --load-via-root-parent, but maybe I just haven't tried hard enough. Thanks, Amit