Re: pglz performance
On Fri, Aug 02, 2019 at 07:52:39PM +0200, Tomas Vondra wrote: > On Fri, Aug 02, 2019 at 10:12:58AM -0700, Andres Freund wrote: >> Why would they be stuck continuing to *compress* with pglz? As we >> fully retoast on write anyway we can just gradually switch over to the >> better algorithm. Decompression speed is another story, of course. > > Hmmm, I don't remember the details of those patches so I didn't realize > it allows incremental recompression. If that's possible, that would mean > existing systems can start using it. Which is good. It may become a problem on some platforms though (Windows?), so patches to improve either the compression or decompression of pglz are not that much crazy as they are still likely going to be used, and for read-mostly switching to a new algo may not be worth the extra cost so it is not like we are going to drop it completely either. My take, still the same as upthread, is that it mostly depends on the amount of complexity each patch introduces compared to the performance gain. > Another question is whether we'd actually want to include the code in > core directly, or use system libraries (and if some packagers might > decide to disable that, for whatever reason). Linking to system libraries would make our maintenance much easier, and when it comes to have a copy of something else in the tree we would be stuck with more maintenance around it. These tend to rot easily. After that comes the case where the compression algo is not in the binary across one server to another, in which case we have an automatic ERROR in case of a mismatching algo, or FATAL for deompression of FPWs at recovery when wal_compression is used. -- Michael signature.asc Description: PGP signature
Re: Ought to use heap_multi_insert() for pg_attribute/depend insertions?
> On 2 Aug 2019, at 00:41, Thomas Munro wrote: > > On Tue, Jul 9, 2019 at 11:07 PM Daniel Gustafsson wrote: >> The attached v3 also has that fix in order to see if the cfbot is happier >> with >> this. > > Noticed while moving this to the next CF: > > heap.c: In function ‘StorePartitionKey’: > 1191heap.c:3582:3: error: ‘referenced’ undeclared (first use in this function) > 1192 referenced.classId = RelationRelationId; > 1193 ^ Thanks, the attached v4 updates to patch to handle a0555ddab9b672a046 as well. cheers ./daniel catalog_multi_insert-v4.patch Description: Binary data
Re: pglz performance
Hi, On 2019-08-05 00:19:28 +0200, Petr Jelinek wrote: > It carries that information inside the compressed value, like I said in the > other reply, that's why the extra byte. I'm not convinced that that is a good plan - imo the reference to the compressed data should carry that information. I.e. in the case of toast, at least toast pointers should hold enough information to determine the compression algorithm. And in the case of WAL, the WAL record should contain that. Consider e.g. adding support for slice fetching of datums compressed with some algorithm - we should be able to determine whether that's possible without fetching the datum (so we can take a non-exceptional path for datums compressed otherwise). Similarly, for WAL, we should be able to detect whether an incompatible compression format is used, without having to invoke a generic compression routine that then fails in some way. Or adding compression reporting for WAL to xlogdump. I also don't particularly like baking in the assumption that we don't support tuples larger than 1GB in further places. To me it seems likely that we're going to have to fix that, and it's hard enough already... I know that my patch did that too... For external datums I suggest encoding the compression method as a distinct VARTAG_ONDISK_COMPRESSED, and then have that include the compression method as a field. For in-line compressed values (so VARATT_IS_4B_C), doing something roughly like you did, indicating the type of metadata following using the high bit sounds reasonable. But I think I'd make it so that if the highbit is set, the struct is instead entirely different, keeping a full 4 byte byte length, and including the compression type header inside the struct. Perhaps I'd combine the compression type with the high-bit-set part? So when the high bit is set, it'd be something like { int32 vl_len_;/* varlena header (do not touch directly!) */ /* * Actually only 7 bit, the high bit determines whether this * is the old compression header (if unset), or this type of header * (if set). */ uint8 type; /* * Stored as uint8[4], to avoid unnecessary alignment padding. */ uint8[4] length; char va_data[FLEXIBLE_ARRAY_MEMBER]; } I think it's worth spending some effort trying to get this right - we'll be bound by design choices for a while. It's kinda annoying that toast datums aren't better designed. Creating them from scratch, I'd make it something like: 1) variable-width integer describing the "physical length", so that tuple deforming can quickly determine length - all the ifs necessary to determine lengths are a bottleneck. I'd probably just use a 127bit encoding, if the high bit is set, there's a following length byte. 2) type of toasted datum, probably also in a variable width encoding, starting at 1byte. Not that I think it's likely we'd overrun 256 types - but it's cheap enough to just declare the high bit as an length extension bit. These are always stored unaligned. So there's no need to deal with padding bytes having to be zero to determine whether we're dealing with a 1byte datum etc. Then, type dependant: For In-line uncompressed datums 3a) alignment padding, amount determined by 2) above, i.e. we'd just have different types for different amounts of alignment. Probably using some heuristic to use unaligned when either dealing with data that doesn't need alignment, or when the datum is fairly small, so copying to get the data as unaligned won't be a significant penalty. 4a) data For in-line compressed datums 3b) compression metadata {varint rawsize, varint compression algorithm} 4b) unaligned compressed data - there's no benefit in keeping it aligned For External toast for uncompressed data: 3d) {toastrelid, valueid, varint rawsize} For External toast for compressed data: 3e) {valueid, toastrelid, varint compression_algorithm, varint rawsize, varint extsize} That'd make it a lot more extensible, easier to understand, faster to decode in a lot of cases, remove a lot of arbitrary limits. Yes, it'd increase the header size for small datums to two bytes, but I think that'd be more than bought back by the other improvements. Greetings, Andres Freund
Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?
At Sun, 4 Aug 2019 00:04:01 -0700 (MST), legrand legrand wrote in <1564902241482-0.p...@n3.nabble.com> > > However having the nested queryid in > > pg_stat_activity would be convenient to track > > what is a long stored functions currently doing. > > +1 > > And this could permit to get wait event sampling per queryid when > pg_stat_statements.track = all I'm strongly on this side emotionally, but also I'm on Tom and Andres's side that exposing querid that way is not the right thing. Doing that means we don't need exact correspondence between top-level query and queryId (in nested or multistatement queries) in this patch. pg_stat_statements will allow us to do the same thing by having additional uint64[MaxBackends] array in pgssSharedState, instead of expanding PgBackendStatus array in core by the same size. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
SSL Connection still showing TLSv1.3 even it is disabled in ssl_ciphers
Hi , While testing SSL version 1.1.1c , I only enabled TLSv1.2 and rest including TLSv1.3 has been disabled , like this - postgres=# show ssl_ciphers ; ssl_ciphers -- TLSv1.2:!aNULL:!SSLv2:!SSLv3:!TLSv1:!TLSv1.3 To cofirm the same, there is a tool called - sslyze ( SSLyze is a Python library and a CLI tool that can analyze the SSL configuration of a server by connecting to it) (https://github.com/nabla-c0d3/sslyze) which i configured on my machine . Run this command - [root@localhost Downloads]# python -m sslyze --sslv2 --sslv3 --tlsv1 --tlsv1_1 --tlsv1_2 --tlsv1_3 localhost:5432 --starttls=postgres --hide_rejected_ciphers AVAILABLE PLUGINS - CompressionPlugin HttpHeadersPlugin OpenSslCcsInjectionPlugin OpenSslCipherSuitesPlugin SessionResumptionPlugin FallbackScsvPlugin CertificateInfoPlugin RobotPlugin HeartbleedPlugin SessionRenegotiationPlugin CHECKING HOST(S) AVAILABILITY - localhost:5432 => 127.0.0.1 SCAN RESULTS FOR LOCALHOST:5432 - 127.0.0.1 --- * SSLV2 Cipher Suites: Server rejected all cipher suites. ** TLSV1_3 Cipher Suites:** ** Server rejected all cipher suites.** * * SSLV3 Cipher Suites: Server rejected all cipher suites. * TLSV1_1 Cipher Suites: Server rejected all cipher suites. * TLSV1_2 Cipher Suites: Forward Secrecy OK - Supported RC4 OK - Not Supported Preferred: TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 ECDH-256 bits 256 bits Accepted: TLS_DHE_RSA_WITH_AES_256_CBC_SHA256 DH-2048 bits 256 bits RSA_WITH_AES_256_CCM_8 - 256 bits TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256 - 256 bits TLS_DHE_RSA_WITH_CHACHA20_POLY1305_SHA256 - 256 bits TLS_RSA_WITH_CAMELLIA_256_CBC_SHA256 - 256 bits RSA_WITH_AES_256_CCM - 256 bits TLS_DHE_RSA_WITH_CAMELLIA_256_CBC_SHA256 - 256 bits ARIA256-GCM-SHA384 - 256 bits TLS_RSA_WITH_AES_256_CBC_SHA256 - 256 bits TLS_ECDHE_RSA_WITH_CAMELLIA_256_CBC_SHA384 - 256 bits TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384 ECDH-256 bits 256 bits DHE_RSA_WITH_AES_256_CCM_8 - 256 bits ECDHE-ARIA256-GCM-SHA384 - 256 bits TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 ECDH-256 bits 256 bits TLS_DHE_RSA_WITH_AES_256_GCM_SHA384 DH-2048 bits 256 bits TLS_RSA_WITH_AES_256_GCM_SHA384 - 256 bits TLS_DHE_RSA_WITH_AES_256_CCM - 256 bits DHE-RSA-ARIA256-GCM-SHA384 - 256 bits TLS_RSA_WITH_CAMELLIA_128_CBC_SHA256 - 128 bits RSA_WITH_AES_128_CCM_8 - 128 bits RSA_WITH_AES_128_CCM - 128 bits DHE_RSA_WITH_AES_128_CCM - 128 bits DHE_RSA_WITH_AES_128_CCM_8 - 128 bits ARIA128-GCM-SHA256 - 128 bits TLS_ECDHE_RSA_WITH_CAMELLIA_128_CBC_SHA256 - 128 bits TLS_DHE_RSA_WITH_AES_128_CBC_SHA256 DH-2048 bits 128 bits TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 ECDH-256 bits 128 bits TLS_RSA_WITH_AES_128_CBC_SHA256 - 128 bits ECDHE-ARIA128-GCM-SHA256 - 128 bits TLS_DHE_RSA_WITH_AES_128_GCM_SHA256 DH-2048 bits 128 bits TLS_RSA_WITH_AES_128_GCM_SHA256 - 128 bits TLS_DHE_RSA_WITH_CAMELLIA_128_CBC_SHA256 - 128 bits DHE-RSA-ARIA128-GCM-SHA256 - 128 bits TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256 ECDH-256 bits 128 bits * TLSV1 Cipher Suites: Server rejected all cipher suites. SCAN COMPLETED IN 0.84 S These are the ones which got rejected for TLSV1_3 * TLSV1_3 Cipher Suites: Rejected: TLS_CHACHA20_POLY1305_SHA256 TLS / Alert: protocol version *TLS_AES_256_GCM_SHA384* TLS / Alert: protocol version TLS_AES_128_GCM_SHA256 TLS / Alert: protocol version TLS_AES_128_CCM_SHA256 TLS / Alert: protocol version TLS_AES_128_CCM_8_SHA256 TLS / Alert: protocol version when i connect to psql terminal - psql.bin (10.9) SSL connection (protocol: TLSv1.3, cipher: *TLS_AES_256_GCM_SHA384*, bits: 256, compression: off) Type "help" for help. postgres=# show ssl_ciphers ; ssl_ciphers -- TLSv1.2:!aNULL:!SSLv2:!SSLv3:!TLSv1:!TLSv1.3 (1 r
Re: More issues with pg_verify_checksums and checksum verification in base backups
On Sat, Aug 03, 2019 at 06:47:48PM +0200, Michael Banck wrote: > first off, a bit of a meta-question: Did the whitelist approach die > completely, or are we going to tackle it again for v13 or later? At this stage, it is burried. Amen. > This is something I still have in the test suite of my pg_checksums > fork, cause I never reverted that one from isRelFile() back to > skipfile() (so it doesn't fail on the above cause `123.' is not > considered a relation file worth checksumming). We could actually fix this one. It is confusing to have pg_checksums generate a report about a segment number which is actually incorrect. > Independently of the whitelist/blacklist question, I believe > pg_checksums should not error out as soon as it encounters a weird looking > file, but either (i) still checksum it or (ii) skip it? Or is that to be > considered a pilot error and it's fine for pg_checksums to fold? That's actually the distinctions between the black and white lists which would have handled that. -- Michael signature.asc Description: PGP signature
Re: Ought to use heap_multi_insert() for pg_attribute/depend insertions?
On Mon, Aug 05, 2019 at 09:20:07AM +0200, Daniel Gustafsson wrote: > Thanks, the attached v4 updates to patch to handle a0555ddab9b672a046 as well. + if (referenced->numrefs == 1) + recordDependencyOn(depender, &referenced->refs[0], behavior); + else + recordMultipleDependencies(depender, + referenced->refs, referenced->numrefs, + behavior); This makes me wonder if we should not just add a shortcut in recordMultipleDependencies() to use recordDependencyOn if there is only one reference in the set. That would save the effort of a multi insert for all callers of recordMultipleDependencies() this way, including the future ones. And that could also be done independently of the addition of InsertPgAttributeTuples(), no? -- Michael signature.asc Description: PGP signature
Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?
On Mon, Aug 5, 2019 at 9:28 AM Kyotaro Horiguchi wrote: > > At Sun, 4 Aug 2019 00:04:01 -0700 (MST), legrand legrand > wrote in <1564902241482-0.p...@n3.nabble.com> > > > However having the nested queryid in > > > pg_stat_activity would be convenient to track > > > what is a long stored functions currently doing. > > > > +1 > > > > And this could permit to get wait event sampling per queryid when > > pg_stat_statements.track = all > > I'm strongly on this side emotionally, but also I'm on Tom and > Andres's side that exposing querid that way is not the right > thing. > > Doing that means we don't need exact correspondence between > top-level query and queryId (in nested or multistatement queries) > in this patch. pg_stat_statements will allow us to do the same > thing by having additional uint64[MaxBackends] array in > pgssSharedState, instead of expanding PgBackendStatus array in > core by the same size. Sure, but the problem with this approach is that all extensions that compute their own queryid would have to do the same. I hope that we can come up with an approach friendlier for those extensions.
Re: Remove HeapTuple and Buffer dependency for predicate locking functions
On Sat, Aug 3, 2019 at 11:56 AM Ashwin Agrawal wrote: > On Wed, Jul 31, 2019 at 2:06 PM Andres Freund wrote: >> I'm also a bit confused why we don't need to pass in the offset of the >> current tuple, rather than the HOT root tuple here. That's not related >> to this patch. But aren't we locking the wrong tuple here, in case of >> HOT? > > Yes, root is being locked here instead of the HOT. But I don't have full > context on the same. If we wish to fix it though, can be easily done now with > the patch by passing "tid" instead of &(heapTuple->t_self). Here are three relevant commits: 1. Commit dafaa3efb75 "Implement genuine serializable isolation level." (2011) locked the root tuple, because it set t_self to *tid. Possibly due to confusion about the effect of the preceding line ItemPointerSetOffsetNumber(tid, offnum). 2. Commit commit 81fbbfe3352 "Fix bugs in SSI tuple locking." (2013) fixed that by adding ItemPointerSetOffsetNumber(&heapTuple->t_self, offnum). 3. Commit b89e151054a "Introduce logical decoding." (2014) also did ItemPointerSet(&(heapTuple->t_self), BufferGetBlockNumber(buffer), offnum), for the benefit of historical MVCC snapshots (unnecessarily, considering the change in the commit #2), but then, intending to "reset to original, non-redirected, tid", clobbered it, reintroducing the bug fixed by #2. My first guess is that commit #3 above was developed before commit #2, and finished up clobbering it. In fact, both logical decoding and SSI want offnum, so we should be able to just remove the "reset" bit (perhaps like in the attached sketch, not really tested, though it passes). This must be in want of an isolation test, but I haven't yet tried to get my head around how to write a test that would show the difference. -- Thomas Munro https://enterprisedb.com diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 94309949fa..107605d276 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -1576,7 +1576,6 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer, heapTuple->t_data = (HeapTupleHeader) PageGetItem(dp, lp); heapTuple->t_len = ItemIdGetLength(lp); heapTuple->t_tableOid = RelationGetRelid(relation); - ItemPointerSetOffsetNumber(&heapTuple->t_self, offnum); /* * Shouldn't see a HEAP_ONLY tuple at chain start. @@ -1608,15 +1607,14 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer, * of the root tuple of the HOT chain. This is important because * the *Satisfies routine for historical mvcc snapshots needs the * correct tid to decide about the visibility in some cases. +* SSI also needs offnum. */ - ItemPointerSet(&(heapTuple->t_self), BufferGetBlockNumber(buffer), offnum); + ItemPointerSetOffsetNumber(&heapTuple->t_self, offnum); /* If it's visible per the snapshot, we must return it */ valid = HeapTupleSatisfiesVisibility(heapTuple, snapshot, buffer); CheckForSerializableConflictOut(valid, relation, heapTuple, buffer, snapshot); - /* reset to original, non-redirected, tid */ - heapTuple->t_self = *tid; if (valid) {
Re: partition routing layering in nodeModifyTable.c
Hi Andres, Fujita-san, On Sat, Aug 3, 2019 at 3:01 AM Andres Freund wrote: > On 2019-07-31 17:04:38 +0900, Amit Langote wrote: > > I looked into trying to do the things I mentioned above and it seems > > to me that revising BeginDirectModify()'s API to receive the > > ResultRelInfo directly as Andres suggested might be the best way > > forward. I've implemented that in the attached 0001. Patches that > > were previously 0001, 0002, and 0003 are now 0002, 003, and 0004, > > respectively. 0002 is now a patch to "remove" > > es_result_relation_info. > > Thanks! Some minor quibbles aside, the non FDW patches look good to me. > > Fujita-san, do you have any comments on the FDW API change? Or anybody > else? Based on the discussion, I have updated the patch. > I'm a bit woried about the move of BeginDirectModify() into > nodeModifyTable.c - it just seems like an odd control flow to me. Not > allowing any intermittent nodes between ForeignScan and ModifyTable also > seems like an undesirable restriction for the future. I realize that we > already do that for BeginForeignModify() (just btw, that already accepts > resultRelInfo as a parameter, so being symmetrical for BeginDirectModify > makes sense), but it still seems like the wrong direction to me. > > The need for that move, I assume, comes from needing knowing the correct > ResultRelInfo, correct? I wonder if we shouldn't instead determine the > at plan time (in setrefs.c), somewhat similar to how we determine > ModifyTable.resultRelIndex. Doesn't look like that'd be too hard? The patch adds a resultRelIndex field to ForeignScan node, which is set to >= 0 value for non-SELECT queries. I first thought to set it only if direct modification is being used, but maybe it'd be simpler to set it even if direct modification is not used. To set it, the patch teaches set_plan_refs() to initialize resultRelIndex of ForeignScan plans that appear under ModifyTable. Fujita-san said he plans to revise the planning of direct-modification style queries to not require a ModifyTable node anymore, but maybe he'll just need to add similar code elsewhere but not outside setrefs.c. > Then we could just have BeginForeignModify, BeginDirectModify, > BeginForeignScan all be called from ExecInitForeignScan(). I too think that it would've been great if we could call both BeginForeignModify and BeginDirectModify from ExecInitForeignScan, but the former's API seems to be designed to be called from ExecInitModifyTable from the get-go. Maybe we should leave that as-is? > Path 04 is such a nice improvement. Besides getting rid of a substantial > amount of code, it also makes the control flow a lot easier to read. Thanks. > > @@ -4644,9 +4645,7 @@ GetAfterTriggersTableData(Oid relid, CmdType cmdType) > > * If there are no triggers in 'trigdesc' that request relevant transition > > * tables, then return NULL. > > * > > - * The resulting object can be passed to the ExecAR* functions. The caller > > - * should set tcs_map or tcs_original_insert_tuple as appropriate when > > dealing > > - * with child tables. > > + * The resulting object can be passed to the ExecAR* functions. > > * > > * Note that we copy the flags from a parent table into this struct (rather > > * than subsequently using the relation's TriggerDesc directly) so that we > > can > > @@ -5750,14 +5749,26 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo > > *relinfo, > >*/ > > if (row_trigger && transition_capture != NULL) > > { > > - TupleTableSlot *original_insert_tuple = > > transition_capture->tcs_original_insert_tuple; > > - TupleConversionMap *map = transition_capture->tcs_map; > > + TupleTableSlot *original_insert_tuple; > > + PartitionRoutingInfo *pinfo = relinfo->ri_PartitionInfo; > > + TupleConversionMap *map = pinfo ? > > + > > pinfo->pi_PartitionToRootMap : > > + > > relinfo->ri_ChildToRootMap; > > booldelete_old_table = > > transition_capture->tcs_delete_old_table; > > boolupdate_old_table = > > transition_capture->tcs_update_old_table; > > boolupdate_new_table = > > transition_capture->tcs_update_new_table; > > boolinsert_new_table = > > transition_capture->tcs_insert_new_table; > > > > /* > > + * Get the originally inserted tuple from the global variable > > and set > > + * the latter to NULL because any given tuple must be read > > only once. > > + * Note that the TransitionCaptureState is shared across many > > calls > > + * to this function. > > + */ > > + original_insert_tuple = > > transition_capture->tcs_original_insert_tuple; > > + transition_capture->tcs_origin
Re: Index Skip Scan
Thanks for the new patch. I've reviewed the skip scan patch, but haven't taken a close look at the uniquekeys patch yet. In my previous review I mentioned that queries of the form: select distinct on(a) a,b from a where a=1; do not lead to a skip scan with the patch, even though the skip scan would be much faster. It's about this piece of code in planner.c /* Consider index skip scan as well */ if (enable_indexskipscan && IsA(path, IndexPath) && ((IndexPath *) path)->indexinfo->amcanskip && root->distinct_pathkeys != NIL) The root->distinct_pathkeys is already filtered for redundant keys, so column 'a' is not in there anymore. Still, it'd be much faster to use the skip scan here, because a regular scan will scan all entries with a=1, even though we're really only interested in the first one. In previous versions, this would be fixed by changing the check in planner.c to use root->uniq_distinct_pathkeys instead of root->distinct_pathkeys, but things change a bit now that the patch is rebased on the unique-keys patch. Would it be valid to change this check to root->query_uniquekeys != NIL to consider skip scans also for this query? Second is about the use of _bt_skip and _bt_read_closest in nbtsearch.c. I don't think _bt_read_closest is correctly implemented, and I'm not sure if it can be used at all, due to concerns by Tom and Peter about such approach. I had a similar idea to only partially read items from a page for another use case, for which I submitted a patch last Friday. However, both Tom and Peter find this idea quite scary [1]. You could take a look at my patch on that thread to see the approach taken to correctly partially read a page (well, correct as far as I can see so far...), but perhaps we need to just use the regular _bt_readpage function which reads everything, although this is unfortunate from a performance point of view, since most of the time we're indeed just interested in the first tuple. Eg. it looks like there's some mixups between index offsets and heap tid's in _bt_read_closest: /* * Save the current item and the previous, even if the * latter does not pass scan key conditions */ ItemPointerData tid = prevItup->t_tid; OffsetNumber prevOffnum = ItemPointerGetOffsetNumber(&tid); _bt_saveitem(so, itemIndex, prevOffnum, prevItup); itemIndex++; _bt_saveitem(so, itemIndex, offnum, itup); itemIndex++; The 'prevOffnum' is the offset number for the heap tid, and not the offset number for the index offset, so it looks like just a random item is saved. Furthermore, index offsets may change due to insertions and vacuums, so if we, at any point, release the lock, these offsets are not necessarily valid anymore. However, currently, the patch just reads the closest and then doesn't consider this page at all anymore, if the first tuple skipped to turns out to be not visible. Consider the following sql to illustrate: create table a (a int, b int, c int); insert into a (select vs, ks, 10 from generate_series(1,5) vs, generate_series(1, 1) ks); create index on a (a,b); analyze a; select distinct on (a) a,b from a order by a,b; a | b ---+--- 1 | 1 2 | 1 3 | 1 4 | 1 5 | 1 (5 rows) delete from a where a=2 and b=1; DELETE 1 select distinct on (a) a,b from a order by a,b; a | b ---+- 1 | 1 2 | 249 ->> this should be b=2, because we deleted a=2 && b=1. however, it doesn't consider any tuples from that page anymore and gives us the first tuple from the next page. 3 | 1 4 | 1 5 | 1 (5 rows) ? -Floris [1] https://www.postgresql.org/message-id/flat/26641.1564778586%40sss.pgh.pa.us#dd8f23e1704f45447185894e1c2a4f2a
Re: POC: Cleaning up orphaned files using undo logs
On Mon, Aug 5, 2019 at 12:09 PM Heikki Linnakangas wrote: > > On 05/08/2019 07:23, Thomas Munro wrote: > > On Mon, Aug 5, 2019 at 3:54 PM Amit Kapila wrote: > >> On Sun, Aug 4, 2019 at 2:46 PM Heikki Linnakangas wrote: > >>> Could we leave out the UNDO and discard worker processes for now? > >>> Execute all UNDO actions immediately at rollback, and after crash > >>> recovery. That would be fine for cleaning up orphaned files, > >> > >> Even if we execute all the undo actions on rollback, we need discard > >> worker to discard undo at regular intervals. Also, what if we get an > >> error while applying undo actions during rollback? Right now, we have > >> a mechanism to push such a request to background worker and allow the > >> session to continue. Instead, we might want to Panic in such cases if > >> we don't want to have background undo workers. > >> > >>> and it > >>> would cut down the size of the patch to review. > >> > >> If we can find some way to handle all cases and everyone agrees to it, > >> that would be good. In fact, we can try to get the basic stuff > >> committed first and then try to get the rest (undo-worker machinery) > >> done. > > > > I think it's definitely worth exploring. > > Yeah. For cleaning up orphaned files, if unlink() fails, we can just log > the error and move on. That's what we do in the main codepath, too. For > any other error, PANIC seems ok. We're not expecting any errors during > undo processing, so it doesn't seems safe to continue running. > > Hmm. Since applying the undo record is WAL-logged, you could run out of > disk space while creating the WAL record. That seems unpleasant. > We might get away by doing some minimum error handling for orphan file cleanup patch, but this facility was supposed to be a generic facility. Assuming, all of us agree on error handling stuff, still, I think we might not be able to get away with the requirement for discard worker to discard the logs. > >>> Can this race condition happen: Transaction A creates a table and an > >>> UNDO record to remember it. The transaction is rolled back, and the file > >>> is removed. Another transaction, B, creates a different table, and > >>> chooses the same relfilenode. It loads the table with data, and commits. > >>> Then the system crashes. After crash recovery, the UNDO record for the > >>> first transaction is applied, and it removes the file that belongs to > >>> the second table, created by transaction B. > >> > >> I don't think such a race exists, but we should verify it once. > >> Basically, once the rollback is complete, we mark the transaction > >> rollback as complete in the transaction header in undo and write a WAL > >> for it. After crash-recovery, we will skip such a transaction. Isn't > >> that sufficient to prevent such a race condition? > > Ok, I didn't realize there's a flag in the undo record to mark it as > applied. Yeah, that fixes it. Seems a bit heavy-weight, but I guess it's > fine. Do you do something different in zheap? I presume writing a WAL > record for every applied undo record would be too heavy there. > For zheap, we collect all the records of a page, apply them together and then write the entire page in WAL. The progress of transaction is updated at either transaction end (rollback complete) or after processing some threshold of undo records. So, generally, the WAL won't be for each undo record apply. > This needs some performance testing. We're creating one extra WAL record > and one UNDO record for every file creation, and another WAL record on > abort. It's probably cheap compared to all the other work done during > table creation, but we should still get some numbers on it. > makes sense. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Identity columns should own only one sequence
Peter Eisentraut wrote: > On 2019-05-08 16:49, Laurenz Albe wrote: > > I believe we should have both: > > > > - Identity columns should only use sequences with an INTERNAL dependency, > > as in Peter's patch. > > I have committed this. Thanks! > > - When a column default is dropped, remove all dependencies between the > > column and sequences. > > There is no proposed patch for this, AFAICT. There was one in https://www.postgresql.org/message-id/3916586ef7f33948235fe60f54a3750046f5d940.camel%40cybertec.at > So I have closed this commit fest item for now. That's fine with me. Yours, Laurenz Albe
Re: Removing unneeded self joins
On 02/08/2019 04:54, Thomas Munro wrote: On Thu, Jun 27, 2019 at 6:42 PM Andrey Lepikhov wrote: Version v.17 of the patch that fix the bug see in attachment. While moving this to the September CF, I noticed that it needs to be updated for the recent pg_list.h API changes. The patch was updated: 1. Changes caused by pg_list.h API changes. 2. Fix the problem of joint clause_relids and required_relids changes [1]. 3. Add eclass mentions of removed relation into the kept relation (field eclass_indexes was introduced by commit 3373c71553). [1] https://www.postgresql.org/message-id/flat/5c21029d-81a2-c999-6744-6a898fcc9a19%40postgrespro.ru -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company >From 72ac38de7fb55fe741677ea75b280eecb443e978 Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Fri, 2 Aug 2019 11:01:47 +0500 Subject: [PATCH] Remove self joins v18 --- src/backend/nodes/outfuncs.c | 19 +- src/backend/optimizer/path/indxpath.c | 28 +- src/backend/optimizer/path/joinpath.c |6 +- src/backend/optimizer/plan/analyzejoins.c | 1021 +++-- src/backend/optimizer/plan/planmain.c |7 +- src/backend/optimizer/util/pathnode.c |3 +- src/backend/optimizer/util/relnode.c | 26 +- src/include/nodes/nodes.h |1 + src/include/nodes/pathnodes.h | 22 +- src/include/optimizer/pathnode.h |4 + src/include/optimizer/paths.h |3 +- src/include/optimizer/planmain.h |7 +- src/test/regress/expected/equivclass.out | 32 + src/test/regress/expected/join.out| 256 +- src/test/regress/sql/equivclass.sql | 17 + src/test/regress/sql/join.sql | 127 +++ 16 files changed, 1480 insertions(+), 99 deletions(-) diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c index 86c31a48c9..9d511d1d1b 100644 --- a/src/backend/nodes/outfuncs.c +++ b/src/backend/nodes/outfuncs.c @@ -2270,7 +2270,8 @@ _outRelOptInfo(StringInfo str, const RelOptInfo *node) WRITE_OID_FIELD(userid); WRITE_BOOL_FIELD(useridiscurrent); /* we don't try to print fdwroutine or fdw_private */ - /* can't print unique_for_rels/non_unique_for_rels; BMSes aren't Nodes */ + WRITE_NODE_FIELD(unique_for_rels); + /* can't print non_unique_for_rels; BMSes aren't Nodes */ WRITE_NODE_FIELD(baserestrictinfo); WRITE_UINT_FIELD(baserestrict_min_security); WRITE_NODE_FIELD(joininfo); @@ -2342,6 +2343,19 @@ _outStatisticExtInfo(StringInfo str, const StatisticExtInfo *node) WRITE_BITMAPSET_FIELD(keys); } +static void +_outUniqueRelInfo(StringInfo str, const UniqueRelInfo *node) +{ + WRITE_NODE_TYPE("UNIQUERELINFO"); + + WRITE_BITMAPSET_FIELD(outerrelids); + if (node->index) + { + WRITE_OID_FIELD(index->indexoid); + WRITE_NODE_FIELD(column_values); + } +} + static void _outEquivalenceClass(StringInfo str, const EquivalenceClass *node) { @@ -4108,6 +4122,9 @@ outNode(StringInfo str, const void *obj) case T_StatisticExtInfo: _outStatisticExtInfo(str, obj); break; + case T_UniqueRelInfo: +_outUniqueRelInfo(str, obj); +break; case T_ExtensibleNode: _outExtensibleNode(str, obj); break; diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c index 5f339fdfde..b57be54df6 100644 --- a/src/backend/optimizer/path/indxpath.c +++ b/src/backend/optimizer/path/indxpath.c @@ -3568,7 +3568,8 @@ ec_member_matches_indexcol(PlannerInfo *root, RelOptInfo *rel, * relation_has_unique_index_for * Determine whether the relation provably has at most one row satisfying * a set of equality conditions, because the conditions constrain all - * columns of some unique index. + * columns of some unique index. If index_info is not null, it is set to + * point to a new UniqueRelInfo containing the index and conditions. * * The conditions can be represented in either or both of two ways: * 1. A list of RestrictInfo nodes, where the caller has already determined @@ -3589,12 +3590,16 @@ ec_member_matches_indexcol(PlannerInfo *root, RelOptInfo *rel, bool relation_has_unique_index_for(PlannerInfo *root, RelOptInfo *rel, List *restrictlist, - List *exprlist, List *oprlist) + List *exprlist, List *oprlist, + UniqueRelInfo **unique_info) { ListCell *ic; Assert(list_length(exprlist) == list_length(oprlist)); + if (unique_info) + *unique_info = NULL; + /* Short-circuit if no indexes... */ if (rel->indexlist == NIL) return false; @@ -3645,6 +3650,7 @@ relation_has_unique_index_for(PlannerInfo *root, RelOptInfo *rel, { IndexOptInfo *ind = (IndexOptInfo *) lfirst(ic); int c; + List *column_values = NIL; /* * If the index is not unique, or not immediately enforced, or if it's @@ -3693,6 +3699,9 @@ relation_has_unique_index_for(PlannerInfo *root, RelOptInfo *rel, if (match_index_to_op
Re: Identity columns should own only one sequence
Peter Eisentraut wrote: > On 2019-05-08 16:49, Laurenz Albe wrote: > > I believe we should have both: > > > > - Identity columns should only use sequences with an INTERNAL dependency, > > as in Peter's patch. > > I have committed this. Since this is a bug fix, shouldn't it be backpatched? Yours, Laurenz Albe
Re: Optimize single tuple fetch from nbtree index
Hi Peter, > Actually, having looked at the test case in more detail, that now > seems less likely. The test case seems designed to reward making it > cheaper to access one specific tuple among a fairly large group of > related tuples -- reducing the number of inner scans is not going to > be possible there. > If this really is totally representative of the case that Floris cares > about, I suppose that the approach taken more or less makes sense. > Unfortunately, it doesn't seem like an optimization that many other > users would find compelling, partly because it's only concerned with > fixed overheads, and partly because most queries don't actually look > like this. Thanks for taking a look. Unfortunately this is exactly the case I care about. I'm a bit puzzled as to why this case wouldn't come up more often by other users though. We have many large tables with timeseries data and it seems to me that with timeseries data, two of the most common queries are: (1) What is the state of { a1,a2, a3 ...} at timepoint t (but you don't know that there's an update *exactly* at timepoint t - so you're left with trying to find the latest update smaller than t) (2) What is the state of { a } at timepoints { t1, t2, t3 ... } Given that a1,a2,a3... are indepedently updating, but similar time series (eg. sensor a1 and sensor a2, but both provide a temperature value and update independently from each other). Both of these can also be done with some kind of DISTINCT ON clause, but this is often already much slower than just doing a nested loop of fast index lookups with LIMIT 1 (this depends on the frequency of the timeseries data itself versus the sampling rate of your query though, for high frequency time series and/or low frequency sampling the LIMIT 1 approach is much faster). Note that there is actually some related work to this - in the Index Skip Scan thread [1] a function called _bt_read_closest was developed which also partially reads the page. A Skip Scan has a very similar access pattern to the use case I describe here, because it's also very likely to just require one tuple from the page. Even though the implementation in that patch is currently incorrect, performance of the Skip Scan would likely also be quite a bit faster if it had a correct implementation of this partial page-read and it wouldn't have to read the full page every time. I have one further question about these index offsets. There are several comments in master that indicate that it's impossible that an item moves 'left' on a page, if we continuously hold a pin on the page. For example, _bt_killitems has a comment like this: * Note that if we hold a pin on the target page continuously from initially * reading the items until applying this function, VACUUM cannot have deleted * any items from the page, and so there is no need to search left from the * recorded offset. (This observation also guarantees that the item is still * the right one to delete, which might otherwise be questionable since heap * TIDs can get recycled.) This holds true even if the page has been modified * by inserts and page splits, so there is no need to consult the LSN. Still, exactly this case happens in practice. In my tests I was able to get behavior like: 1) pin + lock a page in _bt_first 2) read a tuple, record indexOffset (for example offset=100) and heap tid 3) unlock page, but *keep* the pin (end of _bt_first of my patch) 4) lock page again in _bt_next (we still hold the pin, so vacuum shouldn't have occurred) 5) look inside the current page for the heap Tid that we registered earlier 6) we find that we can now find this tuple at indexOffset=98, eg. it moved left. This should not be possible. This case sometimes randomly happens when running 'make check', which is why I added code in my patch to also look left on the page from the previous indexOffset. However, this is in contradiction with the comments (and code) of _bt_killitems. Is the comment incorrect/outdated or is there a bug in vacuum or any other part of Postgres that might move index items left even though there are others holding a pin? -Floris [1] https://www.postgresql.org/message-id/flat/CA%2BhUKGKW4dXTP9G%2BWBskjT09tzD%2B9aMWEm%3DFpeb6RS5SXfPyKw%40mail.gmail.com#21abe755d5cf36aabaaa048c8a282169
psql's meta command \d is broken as of 12 beta2.
Hi, Sorry if this report is duplicate but there is no column relhasoids in pg_catalog.pg_class required by the underlying query of the \d meta command of psql.
Re: psql's meta command \d is broken as of 12 beta2.
Hi po 5. 8. 2019 v 13:35 odesílatel Dmitry Igrishin napsal: > Hi, > > Sorry if this report is duplicate but there is no column relhasoids in > pg_catalog.pg_class required by the underlying query of the \d meta > command of psql. > do you use psql from this release? The psql client should be higher or same like server. Pavel
Re: psql's meta command \d is broken as of 12 beta2.
пн, 5 авг. 2019 г. в 14:42, Pavel Stehule : > > Hi > > > po 5. 8. 2019 v 13:35 odesílatel Dmitry Igrishin napsal: >> >> Hi, >> >> Sorry if this report is duplicate but there is no column relhasoids in >> pg_catalog.pg_class required by the underlying query of the \d meta >> command of psql. > > > do you use psql from this release? > > The psql client should be higher or same like server. > > Pavel Oops, I'm wrong. When I looked at describe.c I understood my mistake. Sorry for noise and thank you!
Re: psql's meta command \d is broken as of 12 beta2.
po 5. 8. 2019 v 13:46 odesílatel Dmitry Igrishin napsal: > пн, 5 авг. 2019 г. в 14:42, Pavel Stehule : > > > > Hi > > > > > > po 5. 8. 2019 v 13:35 odesílatel Dmitry Igrishin > napsal: > >> > >> Hi, > >> > >> Sorry if this report is duplicate but there is no column relhasoids in > >> pg_catalog.pg_class required by the underlying query of the \d meta > >> command of psql. > > > > > > do you use psql from this release? > > > > The psql client should be higher or same like server. > > > > Pavel > Oops, I'm wrong. When I looked at describe.c I understood my mistake. > Sorry for noise and thank you! > no problem Pavel
Re: POC: Cleaning up orphaned files using undo logs
On Mon, Aug 5, 2019 at 6:16 AM Amit Kapila wrote: > For zheap, we collect all the records of a page, apply them together > and then write the entire page in WAL. The progress of transaction is > updated at either transaction end (rollback complete) or after > processing some threshold of undo records. So, generally, the WAL > won't be for each undo record apply. This explanation omits a crucial piece of the mechanism, because Heikki is asking what keeps the undo from being applied multiple times. When we apply the undo records to a page, we also adjust the undo pointers in the page. Since we have an undo pointer per transaction slot, and each transaction has its own slot, if we apply all the undo for a transaction to a page, we can just clear the slot; if we somehow end up back at the same point later, we'll know not to apply the undo a second time because we'll see that there's no transaction slot pointing to the undo we were thinking of applying. If we roll back to a savepoint, or for some other reason choose to apply only some of the undo to a page, we can set the undo record pointer for the transaction back to the value it had before we generated any newer undo. Then, we'll know that the newer undo doesn't need to be applied but the older undo can be applied. At least, I think that's how it's supposed to work. If you just update the progress field, it doesn't guarantee anything, because in the event of a crash, we could end up keeping the page changes but losing the update to the progress, as they are part of separate undo records. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: POC: Cleaning up orphaned files using undo logs
On Sun, Aug 4, 2019 at 5:16 AM Heikki Linnakangas wrote: > I feel that the level of abstraction is not quite right. There are a > bunch of fields, like uur_block, uur_offset, uur_tuple, that are > probably useful for some UNDO resource managers (zheap I presume), but > seem kind of arbitrary. How is uur_tuple different from uur_payload? > Should they be named more generically as uur_payload1 and uur_payload2? > And why two, why not three or four different payloads? In the WAL record > format, there's a concept of "block id", which allows you to store N > number of different payloads in the record, I think that would be a > better approach. Or only have one payload, and let the resource manager > code divide it as it sees fit. > > Many of the fields support a primitive type of compression, where a > field can be omitted if it has the same value as on the first record on > an UNDO page. That's handy. But again I don't like the fact that the > fields have been hard-coded into the UNDO record format. I can see e.g. > the relation oid to be useful for many AMs. But not all. And other AMs > might well want to store and deduplicate other things, aside from the > fields that are in the patch now. I'd like to move most of the fields to > AM specific code, and somehow generalize the compression. One approach > would be to let the AM store an arbitrary struct, and run it through a > general-purpose compression algorithm, using the UNDO page's first > record as the "dictionary". I thought about this, too. I agree that there's something a little unsatisfying about the current structure, but I haven't been able to come up with something that seems definitively better. I think something along the lines of what you are describing here might work well, but I am VERY doubtful about the idea of a fixed-size struct. I think AMs are going to want to store variable-length data: especially tuples, but maybe also other stuff. For instance, imagine some AM that wants to implement locking that's more fine-grained that the four levels of tuple locks we have today: instead of just having key locks and all-columns locks, you could want to store the exact columns to be locked. Or maybe your TIDs are variable-width. And the problem is that as soon as you move to something where you pack in a bunch of variable-sized fields, you lose the ability to refer to thinks using reasonable names. That's where I came up with the idea of an UnpackedUndoRecord: give the common fields that "everyone's going to need" human-readable names, and jam only the strange, AM-specific stuff into the payload. But if those needs are not actually universal but very much AM-specific, then I'm afraid we're going to end up with deeply inscrutable code for packing and unpacking records. I imagine it's possible to come up with a good structure for that, but I don't think we have one today. > I don't like the way UndoFetchRecord returns a palloc'd > UnpackedUndoRecord. I would prefer something similar to the xlogreader > API, where a new call to UndoFetchRecord invalidates the previous > result. On efficiency grounds, to avoid the palloc, but also to be > consistent with xlogreader. I don't think that's going to work very well, because we often need to deal with multiple records at a time. There is (or was) a bulk-fetch interface, but I've also found while experimenting with this code that it can be useful to do things like: current = undo_fetch(starting_record); loop: next = undo_fetch(current->next_record_ptr); if some_test(next): break; undo_free(current); current = next; I think we shouldn't view such cases as exceptions to the general paradigm of looking at undo records one at a time, but instead as the normal case for which everything is optimized. Cases like orphaned file cleanup where the number of undo records is probably small and they're all independent of each other will, I think, turn out to be the exception rather than the rule. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: block-level incremental backup
On Fri, Aug 2, 2019 at 9:13 AM vignesh C wrote: > + rc = system(copycmd); I don't think this patch should be calling system() in the first place. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Problem with default partition pruning
On 2019-Aug-05, yuzuko wrote: > However, I'm still concerned that the block > - > ... > - > is written in the right place as Amit explained [1]. Yeah, I have that patch installed locally and I'm looking about it. Thanks for the reminder. I also have an eye on Horiguchi's patch elsewhere in the thread. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: SSL Connection still showing TLSv1.3 even it is disabled in ssl_ciphers
tushar writes: > when i connect to psql terminal - > psql.bin (10.9) > SSL connection (protocol: TLSv1.3, cipher: *TLS_AES_256_GCM_SHA384*, > bits: 256, compression: off) > Type "help" for help. > postgres=# show ssl_ciphers ; > ssl_ciphers > -- > TLSv1.2:!aNULL:!SSLv2:!SSLv3:!TLSv1:!TLSv1.3 > (1 row) My guess is that OpenSSL ignored your ssl_ciphers setting on the grounds that it's stupid to reject all possible ciphers. In any case, this would be something to raise with them not us. PG does nothing with that value except pass it to SSL_CTX_set_cipher_list. regards, tom lane
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
Greetings, * Tomas Vondra (tomas.von...@2ndquadrant.com) wrote: > On Fri, Aug 02, 2019 at 06:38:46PM -0400, Stephen Frost wrote: > >On Fri, Aug 2, 2019 at 18:27 Tom Lane wrote: > >>Tomas Vondra writes: > >>> There seems to be a consensus that this this not a pg_basebackup issue > >>> (i.e. duplicate values don't make the file invalid), and it should be > >>> handled in ALTER SYSTEM. > >> > >>Yeah. I doubt pg_basebackup is the only actor that can create such > >>situations. > >> > >>> The proposal seems to be to run through the .auto.conf file, remove any > >>> duplicates, and append the new entry at the end. That seems reasonable. > >> > >>+1 > > > >I disagree that this should only be addressed in alter system, as I’ve said > >before and as others have agreed with. Having one set of code that can be > >used to update parameters in the auto.conf and then have that be used by > >pg_basebackup, alter system, and external tools, is the right approach. > > > >The idea that alter system should be the only thing that doesn’t just > >append changes to the file is just going to lead to confusion and bugs down > >the road. > > I don't remember any suggestions ALTER SYSTEM should be the only thing > that can rewrite the config file, but maybe it's buried somewhere in the > thread history. The current proposal certainly does not prohibit any > external tool from doing so, it just says we should expect duplicates. There's an ongoing assumption that's been made that only ALTER SYSTEM could make these changes because nothing else has the full GUC system and a running PG instance to validate everything. The suggestion that an external tool could do it goes against that. If we can't, for whatever reason, work our way towards having code that external tools could leverage to manage .auto.conf, then if we could at least document what the expectations are and what tools can/can't do with the file, that would put us in a better position than where we are now. I strongly believe that whatever the rules and expectations are that we come up with, both ALTER SYSTEM and the in-core and external tools should follow them. If we say to that tools should expect duplicates in the file, then ALTER SYSTEM should as well, which was the whole issue in the first place- ALTER SYSTEM didn't expect duplicates, but the external tools and the GUC system did. If we say that it's acceptable for something to remove duplicate GUC entries from the file, keeping the last one, then external tools should feel comfortable doing that too and we should make it clear what "duplicate" means here and how to identify one. If we say it's optional for a tool to remove duplicates, then we should point out the risk of "running out of disk space" for tool authors to consider. I don't agree with the idea that tool authors should be asked to depend on someone running ALTER SYSTEM to address that risk. If there's a strong feeling that tool authors should be able to depend on PG to perform that cleanup for them, then we should use a mechanism to do so which doesn't involve an entirely optional feature. For reference, all of the above, while not as cleanly as it could have been, was addressed with the recovery.conf/recovery.done system. Tool authors had a good sense that they could replace that file, and that PG would clean it up at exactly the right moment, and there wasn't this ugly interaction with ALTER SYSTEM to have to worry about. That none of this was really even discussed or addressed previously even after being pointed out is really disappointing. Just to be clear, I brought up this exact concern back in *November*: https://www.postgresql.org/message-id/20181127153405.GX3415%40tamriel.snowman.net And now because this was pushed forward and the concerns that I raised ignored, we're having to deal with this towards the end of the release cycle instead of during normal development. The things we're talking about now and which I'm getting push-back on because of the release cycle situation were specifically suggestions I made in the above email in November where I also brought up concern that ALTER SYSTEM would be confused by the duplicates- giving external tools guideance on how to modify .auto.conf, or providing them a tool (or library), or both. None of this should be coming as a surprise to anyone who was following and I feel we should be upset that this was left to such a late point in the release cycle to address these issues. > >>There was a discussion whether to print warnings about the duplicates. I > >>> personally see not much point in doing that - if we consider duplicates > >>> to be expected, and if ALTER SYSTEM has the license to rework the config > >>> file any way it wants, why warn about it? > >> > >>Personally I agree that warnings are unnecessary. > > > >And at least Magnus and I disagree with that, as I recall from this > >thread. Let’s have a clean and clear way to modify the auto.conf and have > >everything that touch
Re: block-level incremental backup
Greetings, * Robert Haas (robertmh...@gmail.com) wrote: > On Fri, Aug 2, 2019 at 9:13 AM vignesh C wrote: > > + rc = system(copycmd); > > I don't think this patch should be calling system() in the first place. +1. Thanks, Stephen signature.asc Description: PGP signature
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
Here's a radical suggestion: replace postgresql.auto.conf with a directory containing multiple files. Each file is named after a configuration parameter, and its content is the value of the parameter. So to remove a special configuration parameter, delete its file. To set it, write the file, replacing an existing file if it exists. For this to work unambiguously we would have to specify an exact, case-sensitive, form of every parameter name that must be used within the auto conf directory. I would suggest using the form listed in the documentation (i.e., lower case, to my knowledge). In order to prevent confusing and surprising behaviour, the system should complain vociferously if it finds a configuration parameter file that is not named after a defined parameter, rather than just ignoring it. On Mon, 5 Aug 2019 at 10:21, Stephen Frost wrote: > Greetings, > > * Tomas Vondra (tomas.von...@2ndquadrant.com) wrote: > > On Fri, Aug 02, 2019 at 06:38:46PM -0400, Stephen Frost wrote: > > >On Fri, Aug 2, 2019 at 18:27 Tom Lane wrote: > > >>Tomas Vondra writes: > > >>> There seems to be a consensus that this this not a pg_basebackup > issue > > >>> (i.e. duplicate values don't make the file invalid), and it should be > > >>> handled in ALTER SYSTEM. > > >> > > >>Yeah. I doubt pg_basebackup is the only actor that can create such > > >>situations. > > >> > > >>> The proposal seems to be to run through the .auto.conf file, remove > any > > >>> duplicates, and append the new entry at the end. That seems > reasonable. > > >> > > >>+1 > > > > > >I disagree that this should only be addressed in alter system, as I’ve > said > > >before and as others have agreed with. Having one set of code that can > be > > >used to update parameters in the auto.conf and then have that be used by > > >pg_basebackup, alter system, and external tools, is the right approach. > > > > > >The idea that alter system should be the only thing that doesn’t just > > >append changes to the file is just going to lead to confusion and bugs > down > > >the road. > > > > I don't remember any suggestions ALTER SYSTEM should be the only thing > > that can rewrite the config file, but maybe it's buried somewhere in the > > thread history. The current proposal certainly does not prohibit any > > external tool from doing so, it just says we should expect duplicates. > > There's an ongoing assumption that's been made that only ALTER SYSTEM > could make these changes because nothing else has the full GUC system > and a running PG instance to validate everything. > > The suggestion that an external tool could do it goes against that. > > If we can't, for whatever reason, work our way towards having code that > external tools could leverage to manage .auto.conf, then if we could at > least document what the expectations are and what tools can/can't do > with the file, that would put us in a better position than where we are > now. > > I strongly believe that whatever the rules and expectations are that we > come up with, both ALTER SYSTEM and the in-core and external tools > should follow them. > > If we say to that tools should expect duplicates in the file, then > ALTER SYSTEM should as well, which was the whole issue in the first > place- ALTER SYSTEM didn't expect duplicates, but the external tools and > the GUC system did. > > If we say that it's acceptable for something to remove duplicate GUC > entries from the file, keeping the last one, then external tools should > feel comfortable doing that too and we should make it clear what > "duplicate" means here and how to identify one. > > If we say it's optional for a tool to remove duplicates, then we should > point out the risk of "running out of disk space" for tool authors to > consider. I don't agree with the idea that tool authors should be asked > to depend on someone running ALTER SYSTEM to address that risk. If > there's a strong feeling that tool authors should be able to depend on > PG to perform that cleanup for them, then we should use a mechanism to > do so which doesn't involve an entirely optional feature. > > For reference, all of the above, while not as cleanly as it could have > been, was addressed with the recovery.conf/recovery.done system. Tool > authors had a good sense that they could replace that file, and that PG > would clean it up at exactly the right moment, and there wasn't this > ugly interaction with ALTER SYSTEM to have to worry about. That none of > this was really even discussed or addressed previously even after being > pointed out is really disappointing. > > Just to be clear, I brought up this exact concern back in *November*: > > > https://www.postgresql.org/message-id/20181127153405.GX3415%40tamriel.snowman.net > > And now because this was pushed forward and the concerns that I raised > ignored, we're having to deal with this towards the end of the release > cycle instead of during normal development. The things we're talking > about now and which I'm gettin
Re: Problem with default partition pruning
On 2019-Aug-05, yuzuko wrote: > So I proposed moving the if() block to the current place. > The latest patch can solve both queries but I found the latter > problem can be solved by setting constraint_exclusion = on. So we have three locations for that test; one is where it currently is, which handles a small subset of the cases. The other is where Amit first proposed putting it, which handles some additional cases; and the third one is where your latest patch puts it, which seems to handle all cases. Isn't that what Amit is saying? If that's correct (and that's what I want to imply with the comment changes I proposed), then we should just accept that version of the patch. I don't think that we care about what happens with constraint_exclusion is on. That's not the recommended value for that setting anyway. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
On Mon, Aug 05, 2019 at 10:21:39AM -0400, Stephen Frost wrote: Greetings, * Tomas Vondra (tomas.von...@2ndquadrant.com) wrote: On Fri, Aug 02, 2019 at 06:38:46PM -0400, Stephen Frost wrote: >On Fri, Aug 2, 2019 at 18:27 Tom Lane wrote: >>Tomas Vondra writes: >>> There seems to be a consensus that this this not a pg_basebackup issue >>> (i.e. duplicate values don't make the file invalid), and it should be >>> handled in ALTER SYSTEM. >> >>Yeah. I doubt pg_basebackup is the only actor that can create such >>situations. >> >>> The proposal seems to be to run through the .auto.conf file, remove any >>> duplicates, and append the new entry at the end. That seems reasonable. >> >>+1 > >I disagree that this should only be addressed in alter system, as I’ve said >before and as others have agreed with. Having one set of code that can be >used to update parameters in the auto.conf and then have that be used by >pg_basebackup, alter system, and external tools, is the right approach. > >The idea that alter system should be the only thing that doesn’t just >append changes to the file is just going to lead to confusion and bugs down >the road. I don't remember any suggestions ALTER SYSTEM should be the only thing that can rewrite the config file, but maybe it's buried somewhere in the thread history. The current proposal certainly does not prohibit any external tool from doing so, it just says we should expect duplicates. There's an ongoing assumption that's been made that only ALTER SYSTEM could make these changes because nothing else has the full GUC system and a running PG instance to validate everything. The suggestion that an external tool could do it goes against that. If we can't, for whatever reason, work our way towards having code that external tools could leverage to manage .auto.conf, then if we could at least document what the expectations are and what tools can/can't do with the file, that would put us in a better position than where we are now. IMO documenting the basic rules, and then doing some cleanup/validation at instance start is the only practical solution, really. You can't really validate "everything" without a running instance, because that's the only place where you have GUCs defined by extensions. I don't see how that could work for external tools, expected to run exactly when the instance is not running. I can't think of a use case where simply appending to the file would not be perfectly sufficient. You can't really do much when the instance is not running. I strongly believe that whatever the rules and expectations are that we come up with, both ALTER SYSTEM and the in-core and external tools should follow them. I'm not against giving external tools such capability, in whatever way we think is appropriate (library, command-line binary, ...). I'm against (a) making that a requirement for the external tools, instead of just allowing them to append to the file, and (b) trying to do that in PG12. We're at beta3, we don't even have any patch, and it does quite work for past releases (although it's not that pressing there, thanks to still having recovery.conf). If we say to that tools should expect duplicates in the file, then ALTER SYSTEM should as well, which was the whole issue in the first place- ALTER SYSTEM didn't expect duplicates, but the external tools and the GUC system did. Sure. If we say that it's acceptable for something to remove duplicate GUC entries from the file, keeping the last one, then external tools should feel comfortable doing that too and we should make it clear what "duplicate" means here and how to identify one. Sure. I don't see why the external tools would bother with doing that, but I agree there's no reason not to document what duplicates mean. If we say it's optional for a tool to remove duplicates, then we should point out the risk of "running out of disk space" for tool authors to consider. I don't agree with the idea that tool authors should be asked to depend on someone running ALTER SYSTEM to address that risk. If there's a strong feeling that tool authors should be able to depend on PG to perform that cleanup for them, then we should use a mechanism to do so which doesn't involve an entirely optional feature. Considering the external tools are only allowed to modify the file while the instance is not running, and that most instances are running all the time, I very much doubt this is a risk we need to worry about. And I don't see why we'd have to run ALTER SYSTEM - I proposed to do the cleanup at instance start too. For reference, all of the above, while not as cleanly as it could have been, was addressed with the recovery.conf/recovery.done system. Tool authors had a good sense that they could replace that file, and that PG would clean it up at exactly the right moment, and there wasn't this ugly interaction with ALTER SYSTEM to have to worry about. That none of this was really even discussed or addressed previ
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
Tomas Vondra writes: > On Mon, Aug 05, 2019 at 10:21:39AM -0400, Stephen Frost wrote: >> I'd be happier with one set of code at least being the recommended >> approach to modifying the file and only one set of code in our codebase >> that's messing with .auto.conf, so that, hopefully, it's done >> consistently and properly and in a way that everything agrees on and >> expects, but if we can't get there due to concerns about where we are in >> the release cycle, et al, then let's at least document what is >> *supposed* to happen and have our code do so. > I think fixing ALTER SYSTEM to handle duplicities, and documenting the > basic rules about modifying .auto.conf is the way to go. I agree. So the problem at the moment is we lack a documentation patch. Who wants to write it? regards, tom lane
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
Isaac Morland writes: > Here's a radical suggestion: replace postgresql.auto.conf with a directory > containing multiple files. Each file is named after a configuration > parameter, and its content is the value of the parameter. Hmm ... that seems like a lot of work and code churn --- in particular, guaranteed breakage of code that works today --- to solve a problem we haven't got. The problem we do have is lack of documentation, which this wouldn't in itself remedy. > In order to prevent confusing and surprising behaviour, the system should > complain vociferously if it finds a configuration parameter file that is > not named after a defined parameter, rather than just ignoring it. As has been pointed out repeatedly, the set of known parameters just isn't that stable. Different builds can recognize different sets of GUCs, even without taking extensions into account. regards, tom lane
Re: POC: Cleaning up orphaned files using undo logs
On Tue, Jul 30, 2019 at 4:02 AM Andres Freund wrote: > I'm a bit worried about expanding the use of > ReadBufferWithoutRelcache(). Not so much because of the relcache itself, > but because it requires doing separate smgropen() calls. While not > crazily expensive, it's also not free. Especially combined with closing > all such relations at transaction end (c.f. AtEOXact_SMgr). > > I'm somewhat inclined to think that this requires a slightly bigger > refactoring than done in this patch. Imo at the very least the smgr > entries ought not to be unowned. But working towards not haven to > re-open the smgr entry for every single trival request ought to be part > of this too. I spent some time trying to analyze this today and I agree with you that there seems to be room for improvement here. When I first looked at your comments, I wasn't too convinced, because access patterns that skip around between undo logs seem like they may be fairly common. Admittedly, there are cases where we want to read from just one undo log over and over again, and it would be good to optimize those, but I was initially a bit unconvinced that that there was a problem here worth being concerned about. Then I realized that you would also repeat the smgropen() if you read a single record that happens to be split across two pages, which seems a little silly. But then I realized that we're being a little silly even in the case where we're reading a single undo record that is stored entirely on a single page. We are certainly going to need to look up the undo log, but as things stand, we'll basically do it twice. For example, in the write path, we'll call UndoLogAllocate() and it will look up an UndoLogControl object for the undo log of interest, and then we'll call ReadBufferWithoutRelcache() which will call smgropen() which will do a hash table lookup to find the SMgrRelation associated with that undo log. That's not a large cost, as you say, but it does seem like it might be better to avoid having two different lookups in the same commonly-used code path, each of which peeks into a different backend-private data structure for information about the very same undo log. The obvious thing to do seems to be to have UndoLogControl objects own SmgrRelations. That would be something of a novelty, since it looks like currently only a Relation ever owns an SMgrRelation, but the smgr infrastructure seems to have been set up in a generic way so as to permit that sort of thing, so it seems like it should be workable. Perhaps UndoLogAllocate() function could return a pointer to the UndoLogControl object as well as UndoRecPtr. Then, there could be a function UndoLogWrite(UndoLogControl *, UndoRecPtr, char *, Size). On the read side, instead of calling UndoRecPtrAssignRelFileNode, maybe the undo log storage layer should provide a function that again returns an UndoLogControl, and then we could have a matching function UndoLogRead(UndoLogControl *, UndoRecPtr, char *, Size). I think this kind of design would address your concerns about using the unowned list, too, since the UndoLogControl objects would be owning the SMgrRelations. It took me a while to understand why you were concerned about using the unowned list, so I'm going to repeat it in my own words to make sure I've got it right, and also to possibly help out anyone else who may also have had difficulty grokking your concern. If we have a bunch of short transactions each of which accesses the same relation, the relcache entry will remain open and the file won't get closed in between, but if we have a bunch of short transactions each of which accesses the same undo log, the undo log will be closed and reopened at the operating system level for each individual transaction. That happens because when an SMgrRelation is "owned," the owner takes care of closing it, and so can keep it open across transactions, but when it's "unowned," it's automatically closed during transaction cleanup. And we should fix it, because closing and reopening the same file for every transaction unnecessarily might be expensive enough to matter, at least a little bit. How does all that sound? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
Re: Tomas Vondra 2019-08-03 <20190803124111.2aaddumd7url5wmq@development> > If we really want to give external tools a sensible (and optional) API > to access the file, a simple command-line tool seems much better. Say we > have something like > > pg_config_file -f PATH --set KEY VALUE > pg_config_file -f PATH --get KEY Fwiw, Debian has pg_conftool (based on the perl lib around PgCommon.pm): NAME pg_conftool - read and edit PostgreSQL cluster configuration files SYNOPSIS pg_conftool [options] [version cluster] [configfile] command DESCRIPTION pg_conftool allows to show and set parameters in PostgreSQL configuration files. If version cluster is omitted, it defaults to the default cluster (see user_clusters(5) and postgresqlrc(5)). If configfile is omitted, it defaults to postgresql.conf. configfile can also be a path, in which case version cluster is ignored. OPTIONS -b, --boolean Format boolean value as on or off (not for "show all"). -s, --short Show only the value (instead of key = value pair). -v, --verbose Verbose output. --help Print help. COMMANDS show parameter|all Show a parameter, or all present in this config file. set parameter value Set or update a parameter. remove parameter Remove (comment out) a parameter from a config file. edit Open the config file in an editor. Unless $EDITOR is set, vi is used. SEE ALSO user_clusters(5), postgresqlrc(5) AUTHOR Christoph Berg Debian2019-07-15 PG_CONFTOOL(1)
Re: pgbench - implement strict TPC-B benchmark
Hello Andres, If not, do you think advisable to spend time improving the evaluator & variable stuff and possibly other places for an overall 15% gain? Also, what would be the likelyhood of such optimization patch to pass? I could do a limited variable management improvement patch, eventually, I have funny ideas to speedup the thing, some of which outlined above, some others even more terrible. Attached is a quick PoC. sh> cat set.sql \set x 0 \set y :x \set z :y \set w :z \set v :w \set x :v \set y :x \set z :y \set w :z \set v :w \set x :v \set y :x \set z :y \set w :z \set v :w \set x1 :x \set x2 :x \set y1 :z \set y0 :w \set helloworld :x Before the patch: sh> ./pgbench -T 10 -f vars.sql ... tps = 802966.183240 (excluding connections establishing) # 0.8M After the patch: sh> ./pgbench -T 10 -f vars.sql ... tps = 2665382.878271 (excluding connections establishing) # 2.6M Which is a (somehow disappointing) * 3.3 speedup. The impact on the 3 complex expressions tests is not measurable, though. Probably variable management should be reworked more deeply to cleanup the code. Questions: - how likely is such a patch to pass? (IMHO not likely) - what is its impact to overall performance when actual queries are performed (IMHO very small). -- Fabien.diff --git a/src/bin/pgbench/Makefile b/src/bin/pgbench/Makefile index 25abd0a875..39bba12c23 100644 --- a/src/bin/pgbench/Makefile +++ b/src/bin/pgbench/Makefile @@ -7,7 +7,7 @@ subdir = src/bin/pgbench top_builddir = ../../.. include $(top_builddir)/src/Makefile.global -OBJS = pgbench.o exprparse.o $(WIN32RES) +OBJS = pgbench.o exprparse.o symbol_table.o $(WIN32RES) override CPPFLAGS := -I. -I$(srcdir) -I$(libpq_srcdir) $(CPPFLAGS) LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils $(libpq_pgport) diff --git a/src/bin/pgbench/exprparse.y b/src/bin/pgbench/exprparse.y index 17c9ec32c5..e5a15ae136 100644 --- a/src/bin/pgbench/exprparse.y +++ b/src/bin/pgbench/exprparse.y @@ -212,6 +212,7 @@ make_variable(char *varname) expr->etype = ENODE_VARIABLE; expr->u.variable.varname = varname; + expr->u.variable.varnum = getOrCreateSymbolId(symbol_table, varname); return expr; } diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c index 570cf3306a..14153ebc35 100644 --- a/src/bin/pgbench/pgbench.c +++ b/src/bin/pgbench/pgbench.c @@ -118,6 +118,7 @@ typedef int pthread_attr_t; static int pthread_create(pthread_t *thread, pthread_attr_t *attr, void *(*start_routine) (void *), void *arg); static int pthread_join(pthread_t th, void **thread_return); +// TODO: mutex? #elif defined(ENABLE_THREAD_SAFETY) /* Use platform-dependent pthread capability */ #include @@ -232,7 +233,9 @@ const char *progname; volatile bool timer_exceeded = false; /* flag from signal handler */ /* - * Variable definitions. + * Variable contents. + * + * Variable names are managed in symbol_table with a number. * * If a variable only has a string value, "svalue" is that value, and value is * "not set". If the value is known, "value" contains the value (in any @@ -243,11 +246,14 @@ volatile bool timer_exceeded = false; /* flag from signal handler */ */ typedef struct { - char *name; /* variable's name */ - char *svalue; /* its value in string form, if known */ - PgBenchValue value; /* actual variable's value */ + int number; /* variable symbol number */ + char *svalue; /* its value in string form, if known */ + PgBenchValue value; /* actual variable's value */ } Variable; +#define undefined_variable_value(v) \ + v.svalue == NULL && v.value.type == PGBT_NO_VALUE + #define MAX_SCRIPTS 128 /* max number of SQL scripts allowed */ #define SHELL_COMMAND_SIZE 256 /* maximum size allowed for shell command */ @@ -395,7 +401,6 @@ typedef struct /* client variables */ Variable *variables; /* array of variable definitions */ int nvariables; /* number of variables */ - bool vars_sorted; /* are variables sorted by name? */ /* various times about current transaction */ int64 txn_scheduled; /* scheduled start time of transaction (usec) */ @@ -493,6 +498,7 @@ static const char *QUERYMODE[] = {"simple", "extended", "prepared"}; * varprefix SQL commands terminated with \gset have this set *to a non NULL value. If nonempty, it's used to prefix the *variable name that receives the value. + * set_varnum variable symbol number for \set and \setshell. * expr Parsed expression, if needed. * stats Time spent in this command. */ @@ -505,6 +511,7 @@ typedef struct Command int argc; char *argv[MAX_ARGS]; char *varprefix; + int set_varnum; PgBenchExpr *expr; SimpleStats stats; } Command; @@ -523,6 +530,10 @@ static int64 total_weight = 0; static int debug = 0; /* debug flag */ +/* also used by exprparse.y */ +#define INITIAL_SYMBOL_TABLE_SIZE 128 +SymbolTable symbol_table = NULL; + /* Bu
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
On Mon, Aug 05, 2019 at 03:52:07PM +0900, Ian Barwick wrote: > On 8/4/19 1:59 AM, Tom Lane wrote:> Tomas Vondra > writes: > >> On Fri, Aug 02, 2019 at 06:08:02PM -0700, Andres Freund wrote: > >>> We're WAY WAY past feature freeze. This isn't the time to rewrite guc.c, > >>> guc-file.l to be suitable for running outside of a backend environment. > > > >> Right. And even if we had the code, it's not quite backpatchable (which > >> we probably should do, considering this is a general ALTER SYSTEM issue, > >> so not pg12-only). > > > >> Not to mention there's no clear consensus this is actually desirable. > >> I'd argue forcing external tools (written in arbitrary language) to use > >> this library (written in C), just to modify a "stupid" text file is a > >> bit overkill. IMO duplicates don't make the file invalid, we should > >> handle that correctly/gracefully, so I don't see why external tools > >> could not simply append to the file. We can deduplicate the file when > >> starting the server, on ALTER SYSTEM, or some other time. > > > > Yup. I'd also point out that even if we had a command-line tool of this > > sort, there would be scenarios where it's not practical or not convenient > > to use. We need not go further than "my tool needs to work with existing > > PG releases" to think of good examples. > > I suspect this hasn't been an issue before, simply because until the removal > of recovery.conf AFAIK there hasn't been a general use-case where you'd need > to modify pg.auto.conf while the server is not running. The use-case which now > exists (i.e. for writing replication configuration) is one where the tool will > need to be version-aware anyway (like pg_basebackup is), so I don't see that > as > a particular deal-breaker. > > But... > > > I think we should just accept the facts on the ground, which are that > > some tools modify pg.auto.conf by appending to it > > +1. It's just a text file... So are crontab and /etc/passwd, but there are gizmos that help make it difficult for people to write complete gobbledygook to those. Does it make sense to discuss tooling of that type? Best, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Re: pgbench - implement strict TPC-B benchmark
Hi, On 2019-08-05 17:38:23 +0200, Fabien COELHO wrote: > Which is a (somehow disappointing) * 3.3 speedup. The impact on the 3 > complex expressions tests is not measurable, though. I don't know why that could be disappointing. We put in much more work for much smaller gains in other places. > Probably variable management should be reworked more deeply to cleanup the > code. Agreed. > Questions: > - how likely is such a patch to pass? (IMHO not likely) I don't see why? I didn't review the patch in any detail, but it didn't look crazy in quick skim? Increasing how much load can be simulated using pgbench, is something I personally find much more interesting than adding capabilities that very few people will ever use. FWIW, the areas I find current pgbench "most lacking" during development work are: 1) Data load speed. The data creation is bottlenecked on fprintf in a single process. The index builds are done serially. The vacuum could be replaced by COPY FREEZE. For a lot of meaningful tests one needs 10-1000s of GB of testdata - creating that is pretty painful. 2) Lack of proper initialization integration for custom scripts. I.e. have steps that are in the custom script that allow -i, vacuum, etc to be part of the script, rather than separately executable steps. --init-steps doesn't do anything for that. 3) pgbench overhead, although that's to a significant degree libpq's fault 4) Ability to cancel pgbench and get approximate results. That currently works if the server kicks out the clients, but not when interrupting pgbench - which is just plain weird. Obviously that doesn't matter for "proper" benchmark runs, but often during development, it's enough to run pgbench past some events (say the next checkpoint). > - what is its impact to overall performance when actual queries >are performed (IMHO very small). Obviously not huge - I'd also not expect it to be unobservably small either. Especially if somebody went and fixed some of the inefficiency in libpq, but even without that. And even moreso, if somebody revived the libpq batch work + the relevant pgbench patch, because that removes a lot of the system/kernel overhead, due to the crazy number of context switches (obviously not realistic for all workloads, but e.g. for plenty java workloads, it is), but leaves the same number of variable accesses etc. Greetings, Andres Freund
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
David Fetter writes: > On Mon, Aug 05, 2019 at 03:52:07PM +0900, Ian Barwick wrote: >> On 8/4/19 1:59 AM, Tom Lane wrote: >>> I think we should just accept the facts on the ground, which are that >>> some tools modify pg.auto.conf by appending to it >> +1. It's just a text file... > So are crontab and /etc/passwd, but there are gizmos that help make it > difficult for people to write complete gobbledygook to those. Does it > make sense to discuss tooling of that type? Perhaps as a future improvement, but it's not happening for v12, at least not unless you accept "use ALTER SYSTEM in a standalone backend" as a usable answer. regards, tom lane
Re: Adding column "mem_usage" to view pg_prepared_statements
On 31.07.2019 1:38, Daniel Migowski wrote: Am 31.07.2019 um 00:29 schrieb Tom Lane: Daniel Migowski writes: Ok, just have read about the commitfest thing. Is the patch OK for that? Or is there generally no love for a mem_usage column here? If it was, I really would add some memory monitoring in our app regarding this. You should certainly put it into the next commitfest. We might or might not accept it, but if it's not listed in the CF we might not remember to even review it. (The CF app is really a to-do list for patches ...) OK, added it there. Thanks for your patience :). Regards, Daniel Migowski The patch is not applied to the most recent source because extra parameter was added to CreateTemplateTupleDesc method. Please rebase it - the fix is trivial. I think that including in pg_prepared_statements information about memory used this statement is very useful. CachedPlanMemoryUsage function may be useful not only for this view, but for example it is also need in my autoprepare patch. I wonder if you consider go further and not only report but control memory used by prepared statements? For example implement some LRU replacement discipline on top of prepared statements cache which can evict rarely used prepared statements to avoid memory overflow. We have such patch for PgPro-EE but it limits only number of prepared statement, not taken in account amount of memory used by them. I think that memory based limit will be more accurate (although it adds more overhead). If you want, I can be reviewer of your patch. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
On Mon, Aug 5, 2019 at 12:25 PM Tom Lane wrote: > Perhaps as a future improvement, but it's not happening for v12, > at least not unless you accept "use ALTER SYSTEM in a standalone > backend" as a usable answer. I'm not sure that would even work for the cases at issue ... because we're talking about setting up recovery parameters, and wouldn't the server run recovery before it got around to do anything with ALTER SYSTEM? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Remove HeapTuple and Buffer dependency for predicate locking functions
Hi, On 2019-08-05 20:58:05 +1200, Thomas Munro wrote: > On Sat, Aug 3, 2019 at 11:56 AM Ashwin Agrawal wrote: > > On Wed, Jul 31, 2019 at 2:06 PM Andres Freund wrote: > >> I'm also a bit confused why we don't need to pass in the offset of the > >> current tuple, rather than the HOT root tuple here. That's not related > >> to this patch. But aren't we locking the wrong tuple here, in case of > >> HOT? > > > > Yes, root is being locked here instead of the HOT. But I don't have full > > context on the same. If we wish to fix it though, can be easily done now > > with the patch by passing "tid" instead of &(heapTuple->t_self). > > Here are three relevant commits: Thanks for digging! > 1. Commit dafaa3efb75 "Implement genuine serializable isolation > level." (2011) locked the root tuple, because it set t_self to *tid. > Possibly due to confusion about the effect of the preceding line > ItemPointerSetOffsetNumber(tid, offnum). > > 2. Commit commit 81fbbfe3352 "Fix bugs in SSI tuple locking." (2013) > fixed that by adding ItemPointerSetOffsetNumber(&heapTuple->t_self, > offnum). Hm. It's not at all sure that it's ok to report the non-root tuple tid here. I.e. I'm fairly sure there was a reason to not just set it to the actual tid. I think I might have written that up on the list at some point. Let me dig in memory and list. Obviously possible that that was also obsoleted by parallel changes. > 3. Commit b89e151054a "Introduce logical decoding." (2014) also did > ItemPointerSet(&(heapTuple->t_self), BufferGetBlockNumber(buffer), > offnum), for the benefit of historical MVCC snapshots (unnecessarily, > considering the change in the commit #2), but then, intending to > "reset to original, non-redirected, tid", clobbered it, reintroducing > the bug fixed by #2. > My first guess is that commit #3 above was developed before commit #2, > and finished up clobbering it. Yea, that sounds likely. > This must be in want of an isolation test, but I haven't yet tried to > get my head around how to write a test that would show the difference. Indeed. Greetings, Andres Freund
Re: POC: Cleaning up orphaned files using undo logs
Hi, On 2019-08-05 11:25:10 -0400, Robert Haas wrote: > The obvious thing to do seems to be to have UndoLogControl objects own > SmgrRelations. That would be something of a novelty, since it looks > like currently only a Relation ever owns an SMgrRelation, but the smgr > infrastructure seems to have been set up in a generic way so as to > permit that sort of thing, so it seems like it should be workable. Yea, I think that'd be a good step. I'm not 100% convinced it's quite enough, due to the way the undo smgr only ever has a single file descriptor open, and that undo log segments are fairly small, and that there'll often be multiple persistence levels active at the same time. But the undo fd handling is probably a separate concern than from who owns the smgr relations. > I think this kind of design would address your concerns about using > the unowned list, too, since the UndoLogControl objects would be > owning the SMgrRelations. Yup. > How does all that sound? A good move in the right direction, imo. Greetings, Andres Freund
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
Robert Haas writes: > On Mon, Aug 5, 2019 at 12:25 PM Tom Lane wrote: >> Perhaps as a future improvement, but it's not happening for v12, >> at least not unless you accept "use ALTER SYSTEM in a standalone >> backend" as a usable answer. > I'm not sure that would even work for the cases at issue ... because > we're talking about setting up recovery parameters, and wouldn't the > server run recovery before it got around to do anything with ALTER > SYSTEM? Yeah, good point. There are a lot of other cases where you really don't want system startup to happen, too. On the other hand, people have also opined that they want full error checking on the inserted values, and that seems out of reach with less than a full running system (mumble extensions mumble). In the end, I think I don't buy Stephen's argument that there should be a one-size-fits-all answer. It seems entirely reasonable that we'll have more than one way to do it, because the constraints are different depending on what use-case you think about. regards, tom lane
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
Hi, On 2019-08-05 12:43:24 -0400, Tom Lane wrote: > Yeah, good point. There are a lot of other cases where you really > don't want system startup to happen, too. Agreed. > On the other hand, people have also opined that they want full error > checking on the inserted values, and that seems out of reach with less > than a full running system (mumble extensions mumble). I think the error checking ought to be about as complete as the one we do during a normal postmaster startup. Afaict that requires loading shared_preload_library extensions, but does *not* require booting up far enough to run GUC checks in a context with database access. The one possible "extension" to that that I can see is that arguably we might want to error out if DefineCustom*Variable() doesn't think the value is valid for a shared_preload_library, rather than just WARNING (i.e. refuse to start). We can't really do that for other libraries, but for shared_preload_libraries it might make sense. Although I suspect the better approach would be to just generally convert that to an error, rather than having some startup specific logic. Greetings, Andres Freund
Re: [HACKERS] Cached plans and statement generalization
On 01.08.2019 19:56, Konstantin Knizhnik wrote: On 31.07.2019 19:56, Heikki Linnakangas wrote: On 09/07/2019 23:59, Konstantin Knizhnik wrote: Fixed patch version of the path is attached. Much of the patch and the discussion has been around the raw parsing, and guessing which literals are actually parameters that have been inlined into the SQL text. Do we really need to do that, though? The documentation mentions pgbouncer and other connection poolers, where you don't have session state, as a use case for this. But such connection poolers could and should still be using the extended query protocol, with Parse+Bind messages and query parameters, even if they don't use named prepared statements. I'd want to encourage applications and middleware to use out-of-band query parameters, to avoid SQL injection attacks, regardless of whether they use prepared statements or cache query plans. So how about dropping all the raw parse tree stuff, and doing the automatic caching of plans based on the SQL string, somewhere in the exec_parse_message? Check the autoprepare cache in exec_parse_message(), if it was an "unnamed" prepared statement, i.e. if the prepared statement name is an empty string. (I'm actually not sure if exec_parse/bind_message is the right place for this, but I saw that your current patch did it in exec_simple_query, and exec_parse/bind_message are the equivalent of that for the extended query protocol). - Heikki I decided to implement your proposal. Much simple version of autoprepare patch is attached. At my computer I got the following results: pgbench -M simple -S 22495 TPS pgbench -M extended -S 47633 TPS pgbench -M prepared -S 47683 TPS So autoprepare speedup execution of queries sent using extended protocol more than twice and it is almost the same as with explicitly prepared statements. I failed to create regression test for it because I do not know how to force psql to use extended protocol. Any advice is welcome. Slightly improved and rebased version of the patch. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/doc/src/sgml/autoprepare.sgml b/doc/src/sgml/autoprepare.sgml new file mode 100644 index 000..c4d96e6 --- /dev/null +++ b/doc/src/sgml/autoprepare.sgml @@ -0,0 +1,62 @@ + + + + Autoprepared statements + + + autoprepared statements + + + +PostgreSQL makes it possible prepare +frequently used statements to eliminate cost of their compilation +and optimization on each execution of the query. On simple queries +(like ones in pgbench -S) using prepared statements +increase performance more than two times. + + + +Unfortunately not all database applications are using prepared statements +and, moreover, it is not always possible. For example, in case of using +pgbouncer or any other session pooler, +there is no session state (transactions of one client may be executed at different +backends) and so prepared statements can not be used. + + + +Autoprepare mode allows to overcome this limitation. +In this mode Postgres stores plans for all queries pass using extended protocol. +Speed of execution of autoprepared statements is almost the same as of explicitly prepared statements. + + + +By default autoprepare mode is switched off. To enable it, assign non-zero +value to GUC variable autoprepare_limit or autoprepare_memory_limit. +This variables specify limit for number of autoprepared statement or memory used by them. +Autoprepare is enabled if one of thme is non zero. Value -1 means unlimited. +Please notice that event autoprepare is anabled, Postgres makes a decision about using +generalized plan vs. customized execution plans based on the results +of comparison of average time of five customized plans with +time of generalized plan. + + + +Too large number of autoprepared statements can cause memory overflow +(especially if there are many active clients, because prepared statements cache +is local to the backend). So using unlimited autoprepare hash is debgerous and not recommended. +Postgres is using LRU policy to keep in memory most frequently used queries. + + + +Autoprepare hash is local to the backend. It is implicitely reseted on any change of database schema or +session variables. + + + +It is possible to inspect autoprepared queries in the backend using +pg_autoprepared_statements view. It shows original text of the +query, types of the extracted parameters (replacing literals), +query execution counter and used memory. + + + diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index cdc30fa..28c9343 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -5326,6 +5326,39 @@ SELECT * FROM parent WHERE key = 2400; + + autoprepare_limit (int
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
On Mon, Aug 5, 2019 at 10:21 AM Stephen Frost wrote: > Just to be clear, I brought up this exact concern back in *November*: > > https://www.postgresql.org/message-id/20181127153405.GX3415%40tamriel.snowman.net > > And now because this was pushed forward and the concerns that I raised > ignored, we're having to deal with this towards the end of the release > cycle instead of during normal development. I disagree. My analysis is that you're blocking a straightforward bug fix because we're not prepared to redesign the world to match your expectations. The actual point of controversy at the moment, as I understand it, is this: if the backend, while rewriting postgresql.auto.conf, discovers that it contains duplicates, should we (a) give a WARNING or (b) not? The argument for not doing that is pretty simple: if we give a WARNING when this happens, then every tool that appends to postgresql.auto.conf has to be responsible for making sure to remove duplicates along the way. To do that reliably, it needs a client-accessible version of all the GUC parsing stuff. You refer to this above as an "assumption," but it seems to me that a more accurate word would be "fact." Now, I don't think anyone would disagree with the idea that it possible to do it in an only-approximately-correct way pretty easily: just match the first word of the line against the GUC you're proposing to set, and drop the line if it matches. If you want it to be exactly correct, though, you either need to run the original code, or your own custom code that behaves in exactly the same way. And since the original code runs only in the server, it follows directly that if you are not running inside the server, you cannot be running the original code. How you can label any of that as an "assumption" is beyond me. Now, I agree that IF we were prepared to provide a standalone config-editing tool that removes duplicates, THEN it would not be crazy to emit a WARNING if we find a duplicate, because we could reasonably tell people to just use that tool. However, such a tool is not trivial to construct, as evidenced by the fact that, on this very thread, Ian tried and Andres thought the result contained too much code duplication. Moreover, we are past feature freeze, which is the wrong time to add altogether new things to the release, even if we had code that everybody liked. Furthermore, even if we had such a tool and even if it had already been committed, I would still not be in favor of the WARNING, because duplicate settings in postgresql.auto.conf are harmless unless you include a truly insane number of them, and there is no reason for anybody to ever do that. In a way, I sorta hope somebody does do that, because if I get a problem report from a user that they put 10 million copies of their recovery settings in postgresql.auto.conf and the server now starts up very slowly, I am going to have a good private laugh, and then suggest that they maybe not do that. In general, I am sympathetic to the argument that we ought to do tight integrity-checking on inputs: that's one of the things for which PostgreSQL is known, and it's a strength of the project. In this case, though, the cost-benefit trade-off seems very poor to me: it just makes life complicated without really buying us anything. The whole reason postgresql.conf is a text file in the first place instead of being stored in the catalogs is because you might not be able to start the server if it's not set right, and if you can't edit it without being able to start the server, then you're stuck. Indeed, one of the key arguments in getting ALTER SYSTEM accepted in the first place was that, if you put dumb settings into postgresql.auto.conf and borked your system so it wouldn't start, you could always use a text editor to undo it. Given that history, any argument that postgresql.auto.conf is somehow different and should be subjected to tighter integrity constraints does not resonate with me. Its mission is to be a machine-editable postgresql.conf, not to be some other kind of file that plays by a different set of rules. I really don't understand why you're fighting so hard about this. We have a long history of being skeptical about WARNING messages. If, on the one hand, they are super-important, they might still get ignored because it could be an automated context where nobody will see it; and if, on the other hand, they are not that important, then emitting them is just clutter in the first place. The particular WARNING under discussion here is one that would likely only fire long after the fact, when it's far too late to do anything about it, and when, in all probability, no real harm has resulted anyway. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Adding column "mem_usage" to view pg_prepared_statements
Hi, On 2019-07-28 06:20:40 +, Daniel Migowski wrote: > how do you want to generalize it? Are you thinking about a view solely > for the display of the memory usage of different objects? I'm not quite sure. I'm just not sure that adding separate infrastructure for various objects is a sutainable approach. We'd likely want to have this for prepared statements, for cursors, for the current statement, for various caches, ... I think an approach would be to add an 'owning_object' field to memory contexts, which has to point to a Node type if set. A table returning reporting function could recursively walk through the memory contexts, starting at TopMemoryContext. Whenever it encounters a context with owning_object set, it prints a string version of nodeTag(owning_object). For some node types it knows about (e.g. PreparedStatement, Portal, perhaps some of the caches), it prints additional metadata specific to the type (so for prepared statement it'd be something like 'prepared statement', '[name of prepared statement]'), and prints information about the associated context and all its children. The general context information probably should be something like: context_name, context_ident, context_total_bytes, context_total_blocks, context_total_freespace, context_total_freechunks, context_total_used, context_total_children context_self_bytes, context_self_blocks, context_self_freespace, context_self_freechunks, context_self_used, context_self_children, It might make sense to have said function return a row for the contexts it encounters that do not have an owner set too (that way we'd e.g. get CacheMemoryContext handled), but then still recurse. Arguably the proposed owning_object field would be a bit redundant with the already existing ident/MemoryContextSetIdentifier field, which e.g. already associates the query string with the contexts used for a prepared statement. But I'm not convinced that's going to be enough context in a lot of cases, because e.g. for prepared statements it could be interesting to have access to both the prepared statement name, and the statement. The reason I like something like this is that we wouldn't add new columns to a number of views, and lack views to associate such information to for some objects. And it'd be disproportional to add all the information to numerous places anyway. One counter-argument is that it'd be more expensive to get information specific to prepared statements (or other object types) that way. I'm not sure I buy that that's a problem - this isn't something that's likely going to be used at a high frequency. But if it becomes a problem, we can add a function that starts that process at a distinct memory context (e.g. a function that does this just for a single prepared statement, identified by name) - but I'd not start there. > While being interesting I still believe monitoring the mem usage of > prepared statements is a bit more important than that of other objects > because of how they change memory consumption of the server without > using any DDL or configuration options and I am not aware of other > objects with the same properties, or are there some? And for the other > volatile objects like tables and indexes and their contents PostgreSQL > already has it's information functions. Plenty other objects have that property. E.g. cursors. And for the catalog/relation/... caches it's even more pernicious - the client might have closed all its "handles", but we still use memory (and it's absolutely crucial for performance). Greetings, Andres Freund
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
Greetings, * Tom Lane (t...@sss.pgh.pa.us) wrote: > Robert Haas writes: > > On Mon, Aug 5, 2019 at 12:25 PM Tom Lane wrote: > >> Perhaps as a future improvement, but it's not happening for v12, > >> at least not unless you accept "use ALTER SYSTEM in a standalone > >> backend" as a usable answer. > > > I'm not sure that would even work for the cases at issue ... because > > we're talking about setting up recovery parameters, and wouldn't the > > server run recovery before it got around to do anything with ALTER > > SYSTEM? > > Yeah, good point. There are a lot of other cases where you really > don't want system startup to happen, too. On the other hand, > people have also opined that they want full error checking on > the inserted values, and that seems out of reach with less than > a full running system (mumble extensions mumble). There have been ideas brought up about some way to provide "full validation" but I, at least, don't recall seeing anyone actually say that they *want* that- just different people suggesting that it could be done. I agree that full validation is a pipe dream for this kind of system and isn't what I was intending to suggest at any point. > In the end, I think I don't buy Stephen's argument that there should > be a one-size-fits-all answer. It seems entirely reasonable that > we'll have more than one way to do it, because the constraints are > different depending on what use-case you think about. This doesn't seem to me, at least, to be an accurate representation of my thoughts on this- there could be 15 different ways to modify the file, but let's have a common set of code to provide those ways instead of different code between the backend ALTER SYSTEM and the frontend pg_basebackup (and if we put it in the common library that we already have for sharing code between the backend and the frontend, and which we make available for external tools, then those external tools could use those methods in the same way that we do). I'm happy to be told I'm wrong, but as far as I know, there's nothing in appending to the file or removing duplicates that actually requires validation of the values which are included in order to apply those operations correctly. I'm sure I'll be told again about how we can't do this for 12, and I do appreciate that, but it's because we ignored these issues during development that we're here and that's really just not something that's acceptable in my view- we shouldn't be pushing features which have known issues that we then have to fight about how to fix it at the last minute and with the constraint that we can't make any big changes. Thanks, Stephen signature.asc Description: PGP signature
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
Greetings, * Robert Haas (robertmh...@gmail.com) wrote: > On Mon, Aug 5, 2019 at 10:21 AM Stephen Frost wrote: > > Just to be clear, I brought up this exact concern back in *November*: > > > > https://www.postgresql.org/message-id/20181127153405.GX3415%40tamriel.snowman.net > > > > And now because this was pushed forward and the concerns that I raised > > ignored, we're having to deal with this towards the end of the release > > cycle instead of during normal development. > > I disagree. It's unclear what you're disagreeing with here as the below response doesn't seem to discuss the question about if these issues were brought up and pointed out previously, nor about if I was the one who raised them, nor about if we're towards the end of the release cycle. > My analysis is that you're blocking a straightforward bug > fix because we're not prepared to redesign the world to match your > expectations. The actual point of controversy at the moment, as I > understand it, is this: if the backend, while rewriting > postgresql.auto.conf, discovers that it contains duplicates, should we > (a) give a WARNING or (b) not? No, that isn't the point of the controversy nor does it relate, at all, to what you quoted above, so I don't think there's much value in responding to the discussion about WARNING or not that you put together below. Thanks, Stephen signature.asc Description: PGP signature
Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?
Kyotaro Horiguchi-4 wrote > At Sun, 4 Aug 2019 00:04:01 -0700 (MST), legrand legrand < > legrand_legrand@ > > wrote in < > 1564902241482-0.post@.nabble >> >> > However having the nested queryid in >> > pg_stat_activity would be convenient to track >> > what is a long stored functions currently doing. >> >> +1 >> >> And this could permit to get wait event sampling per queryid when >> pg_stat_statements.track = all > > I'm strongly on this side emotionally, but also I'm on Tom and > Andres's side that exposing querid that way is not the right > thing. > > Doing that means we don't need exact correspondence between > top-level query and queryId (in nested or multistatement queries) > in this patch. pg_stat_statements will allow us to do the same > thing by having additional uint64[MaxBackends] array in > pgssSharedState, instead of expanding PgBackendStatus array in > core by the same size. > > regards. > > -- > Kyotaro Horiguchi > NTT Open Source Software Center Hi Kyotaro, Thank you for this answer. What you propose here is already available Inside pg_stat_sql_plans extension (a derivative from Pg_stat_statements and pg_store_plans) And I’m used to this queryid behavior with top Level Queries... My emotion was high but I will accept it ! Regards PAscal -- Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
Greetings, * Andres Freund (and...@anarazel.de) wrote: > On 2019-08-05 12:43:24 -0400, Tom Lane wrote: > > On the other hand, people have also opined that they want full error > > checking on the inserted values, and that seems out of reach with less > > than a full running system (mumble extensions mumble). > > I think the error checking ought to be about as complete as the one we > do during a normal postmaster startup. Afaict that requires loading > shared_preload_library extensions, but does *not* require booting up far > enough to run GUC checks in a context with database access. I'm not following this thread of the discussion. You're not suggesting that pg_basebackup perform this error checking after it modifies the file, are you..? Where are you thinking this error checking would be happening? Thanks, Stephen signature.asc Description: PGP signature
Re: Unused header file inclusion
On 2019-Aug-04, vignesh C wrote: > Made the fixes based on your comments, updated patch has the changes > for the same. Well, you fixed the two things that seem to me quoted as examples of problems, but you didn't fix other occurrences of the same issues elsewhere. For example, you remove lwlock.h from dsa.c but there are structs there that have LWLocks as members. That's just the first hunk of the patch; didn't look at the others but it wouldn't surprise that they have similar issues. I suggest this patch should be rejected. Then there's the removal, which is in tuplesort.c because of INT_MAX as added by commit d26559dbf356 and still present ... FWIW sharedtuplestore.c, a very young file, also includes but that appears to be copy-paste of includes from some other file (and also in an inappropriate place), so I have no objections to obliterating that one. But other than that one line, this patch needs more "adult supervision". -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
Hi, On 2019-08-05 13:34:39 -0400, Stephen Frost wrote: > * Andres Freund (and...@anarazel.de) wrote: > > On 2019-08-05 12:43:24 -0400, Tom Lane wrote: > > > On the other hand, people have also opined that they want full error > > > checking on the inserted values, and that seems out of reach with less > > > than a full running system (mumble extensions mumble). > > > > I think the error checking ought to be about as complete as the one we > > do during a normal postmaster startup. Afaict that requires loading > > shared_preload_library extensions, but does *not* require booting up far > > enough to run GUC checks in a context with database access. > > I'm not following this thread of the discussion. It's about the future, not v12. > Where are you thinking this error checking would be happening? A hypothethical post v12 tool that can set config values with as much checking as feasible. The IMO most realistic tool to do so is postmaster itself, similar to it's already existing -C. Boot it up until shared_preload_libraries have been processed, run check hook(s) for the new value(s), change postgresql.auto.conf, shutdown. > You're not suggesting that pg_basebackup perform this error checking > after it modifies the file, are you..? Not at the moment, at least. Greetings, Andres Freund
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
On Mon, Aug 5, 2019 at 1:29 PM Stephen Frost wrote: > No, that isn't the point of the controversy nor does it relate, at all, > to what you quoted above, so I don't think there's much value in > responding to the discussion about WARNING or not that you put together > below. Well, if that's not what we're arguing about, then what the heck are we arguing about? All we need to do to resolve this issue is have Tom commit his patch. The set of people who are objecting to that is either {} or {Stephen Frost}. Even if it's the latter, we should just proceed, because there are clearly enough votes in favor of the patch to proceed, including 2 from RMT members, and if it's the former, we should DEFINITELY proceed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
Greetings, On Mon, Aug 5, 2019 at 14:11 Andres Freund wrote: > On 2019-08-05 13:34:39 -0400, Stephen Frost wrote: > > * Andres Freund (and...@anarazel.de) wrote: > > > On 2019-08-05 12:43:24 -0400, Tom Lane wrote: > > > > On the other hand, people have also opined that they want full error > > > > checking on the inserted values, and that seems out of reach with > less > > > > than a full running system (mumble extensions mumble). > > > > > > I think the error checking ought to be about as complete as the one we > > > do during a normal postmaster startup. Afaict that requires loading > > > shared_preload_library extensions, but does *not* require booting up > far > > > enough to run GUC checks in a context with database access. > > > > I'm not following this thread of the discussion. > > It's about the future, not v12. I’m happy to chat about post-v12, certainly. As I said up thread, I get that we are in this unfortunate situation for v12 and we can do what needs doing here (where I agree with what Tom said, “a doc patch”- and with fixes for ALTER SYSTEM to be in line with that doc patch, along with pg_basebackup, should any changes be needed, of course). > Where are you thinking this error checking would be happening? > > A hypothethical post v12 tool that can set config values with as much > checking as feasible. The IMO most realistic tool to do so is postmaster > itself, similar to it's already existing -C. Boot it up until > shared_preload_libraries have been processed, run check hook(s) for the > new value(s), change postgresql.auto.conf, shutdown. That’s a nice idea but I don’t think it’s really necessary and I’m not sure how useful this level of error checking would end up being as part of pg_basebackup. I can certainly see value in a tool that could be run to verify a postgresql.conf+auto.conf is valid to the extent that we are able to do so, since that could, ideally, be run by the init script system prior to a restart to let the user know that their restart is likely to fail. Having that be some option to the postmaster could work, as long as it is assured to not do anything that would upset a running PG instance (like, say, try to allocate shared buffers). > You're not suggesting that pg_basebackup perform this error checking > > after it modifies the file, are you..? > > Not at the moment, at least. Since pg_basebackup runs, typically, on a server other than the one that PG is running on, it certainly would have to have a way to at least disable anything that caused it to try and load libraries on the destination side, or do anything else that required something external in order to validate- but frankly I don’t think it should ever be loading libraries that it has no business with, not even if it means that the error checking of the postgresql.conf would be wonderful. Thanks, Stephen >
Re: Unused header file inclusion
Alvaro Herrera writes: > Then there's the removal, which is in tuplesort.c because of > INT_MAX as added by commit d26559dbf356 and still present ... One has to be especially wary of removing system-header inclusions; the fact that they don't seem to be needed on your own machine doesn't prove they aren't needed elsewhere. > ... I suggest this patch should be rejected. Yeah. If we do anything along this line it should be based on pgrminclude results, and even then I think it'd require manual review, especially for changes in header files. The big picture here is that removing #includes is seldom worth the effort it takes :-( regards, tom lane
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
Robert Haas writes: > All we need to do to resolve this issue is have Tom commit his patch. I think Stephen is not being unreasonable to suggest that we need some documentation about what external tools may safely do to pg.auto.conf. So somebody's got to write that. But I agree that we could go ahead with the code patch. (At this point I won't risk doing so before the wrap, though.) regards, tom lane
Re: POC: Cleaning up orphaned files using undo logs
Hi, (as I was out of context due to dealing with bugs, I've switched to lookign at the current zheap/undoprocessing branch. On 2019-07-30 01:02:20 -0700, Andres Freund wrote: > +/* > + * Insert a previously-prepared undo records. > + * > + * This function will write the actual undo record into the buffers which are > + * already pinned and locked in PreparedUndoInsert, and mark them dirty. > This > + * step should be performed inside a critical section. > + */ Again, I think it's not ok to just assume you can lock an essentially unbounded number of buffers. This seems almost guaranteed to result in deadlocks. And there's limits on how many lwlocks one can hold etc. As far as I can tell there's simply no deadlock avoidance scheme in use here *at all*? I must be missing something. > + /* Main loop for writing the undo record. */ > + do > + { I'd prefer this to not be a do{} while(true) loop - as written I need to read to the end to see what the condition is. I don't think we have any loops like that in the code. > + /* > + * During recovery, there might be some blocks which > are already > + * deleted due to some discard command so we can just > skip > + * inserting into those blocks. > + */ > + if (!BufferIsValid(buffer)) > + { > + Assert(InRecovery); > + > + /* > + * Skip actual writing just update the context > so that we have > + * write offset for inserting into next blocks. > + */ > + SkipInsertingUndoData(&ucontext, BLCKSZ - > starting_byte); > + if (ucontext.stage == UNDO_PACK_STAGE_DONE) > + break; > + } How exactly can this happen? > + else > + { > + page = BufferGetPage(buffer); > + > + /* > + * Initialize the page whenever we try to write > the first > + * record in page. We start writing > immediately after the > + * block header. > + */ > + if (starting_byte == UndoLogBlockHeaderSize) > + UndoPageInit(page, BLCKSZ, > prepared_undo->urec->uur_info, > + > ucontext.already_processed, > + > prepared_undo->urec->uur_tuple.len, > + > prepared_undo->urec->uur_payload.len); > + > + /* > + * Try to insert the record into the current > page. If it > + * doesn't succeed then recall the routine with > the next page. > + */ > + InsertUndoData(&ucontext, page, starting_byte); > + if (ucontext.stage == UNDO_PACK_STAGE_DONE) > + { > + MarkBufferDirty(buffer); > + break; At this point we're five indentation levels deep. I'd extract at least either the the per prepared undo code or the code performing the writing across block boundaries into a separate function. Perhaps both. > +/* > + * Helper function for UndoGetOneRecord > + * > + * If any of rmid/reloid/xid/cid is not available in the undo record, then > + * it will get the information from the first complete undo record in the > + * page. > + */ > +static void > +GetCommonUndoRecInfo(UndoPackContext *ucontext, UndoRecPtr urp, > + RelFileNode rnode, UndoLogCategory > category, Buffer buffer) > +{ > + /* > + * If any of the common header field is not available in the current > undo > + * record then we must read it from the first complete record of the > page. > + */ How is it guaranteed that the first record on the page is actually from the current transaction? Can't there be a situation where that's from another transaction? > +/* > + * Helper function for UndoFetchRecord and UndoBulkFetchRecord > + * > + * curbuf - If an input buffer is valid then this function will not release > the > + * pin on that buffer. If the buffer is not valid then it will assign curbuf > + * with the first buffer of the current undo record and also it will keep the > + * pin and lock on that buffer in a hope that while traversing the undo chain > + * the caller might want to read the previous undo record from the same > block. > + */ Wait,
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
Greetings, On Mon, Aug 5, 2019 at 14:29 Tom Lane wrote: > Robert Haas writes: > > All we need to do to resolve this issue is have Tom commit his patch. > > I think Stephen is not being unreasonable to suggest that we need some > documentation about what external tools may safely do to pg.auto.conf. I dare say that if we had some documentation around what to expect and how to deal with it, for our own tools as well as external ones, then maybe we wouldn’t be in this situation in the first place. Clearly ALTER SYSTEM and the pg_basebackup modifications had different understands and expectations. So somebody's got to write that. But I agree that we could go ahead > with the code patch. I haven’t looked at the code patch at all, just to be clear. That said, if you’re comfortable with it and it’s in line with what we document as being how you handle pg.auto.conf (for ourselves as well as external tools..), then that’s fine with me. Thanks, Stephen >
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
On Mon, Aug 5, 2019 at 2:29 PM Tom Lane wrote: > Robert Haas writes: > > All we need to do to resolve this issue is have Tom commit his patch. > > I think Stephen is not being unreasonable to suggest that we need some > documentation about what external tools may safely do to pg.auto.conf. > So somebody's got to write that. I mean, really? We're going to document that if you want to add a setting to the file, you can just append it, but that if you find yourself desirous of appending so many settings that the entire disk will fill up, you should maybe reconsider? Perhaps I'm being mean here, but that seems like it's straight out of the blinding-flashes-of-the-obvious department. If we were going to adopt Stephen's proposed rule that you must remove duplicates or be punished later with an annoying WARNING, I would agree that it ought to be documented, because I believe many people would find that a POLA violation. And to be clear, I'm not objecting to a sentence someplace that says that duplicates in postgresql.auto.conf will be ignored and the last value will be used, same as for any other PostgreSQL configuration file. However, I think it's highly likely people would have assumed that anyway. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
Robert Haas writes: > On Mon, Aug 5, 2019 at 2:29 PM Tom Lane wrote: >> I think Stephen is not being unreasonable to suggest that we need some >> documentation about what external tools may safely do to pg.auto.conf. >> So somebody's got to write that. > I mean, really? We're going to document that if you want to add a > setting to the file, you can just append it, but that if you find > yourself desirous of appending so many settings that the entire disk > will fill up, you should maybe reconsider? Perhaps I'm being mean > here, but that seems like it's straight out of the > blinding-flashes-of-the-obvious department. I don't think we need to go on about it at great length, but it seems to me that it'd be reasonable to point out that (a) you'd be well advised not to touch the file while the postmaster is up, and (b) last setting wins. Those things are equally true of postgresql.conf of course, but I don't recall whether they're already documented. regards, tom lane
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
On Mon, Aug 5, 2019 at 2:42 PM Tom Lane wrote: > I don't think we need to go on about it at great length, but it seems > to me that it'd be reasonable to point out that (a) you'd be well > advised not to touch the file while the postmaster is up, and (b) > last setting wins. Those things are equally true of postgresql.conf > of course, but I don't recall whether they're already documented. OK, fair enough. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
On Mon, Aug 5, 2019 at 2:35 PM Stephen Frost wrote: > I dare say that if we had some documentation around what to expect and how to > deal with it, for our own tools as well as external ones, then maybe we > wouldn’t be in this situation in the first place. Clearly ALTER SYSTEM and > the pg_basebackup modifications had different understands and expectations. But that's not the problem. The problem is that ALTER SYSTEM modifies the first match instead of the last one, when it's a well-established principle that when reading from a PostgreSQL configuration file, the system adopts the value from the last match, not the first one. I admit that if somebody had thought to document what ALTER SYSTEM was doing, that person would probably have also realized that they had made a mistake in the code, and then they would have fixed the bug, and that would be great. But we have exactly zero need to invent a new set of principles explaining how to deal with postgresql.auto.conf. We just need to make the ALTER SYSTEM code conform to the same general rule that has been well-established for many years. The only reason why we're still carrying on about this 95 messages later is because you're trying to make an argument that postgresql.auto.conf is a different kind of thing from postgresql.conf and therefore can have its own novel set of rules which consequently need to be debated. IMHO, it's not, it shouldn't, and they don't. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: The unused_oids script should have a reminder to use the 8000-8999 OID range
On Fri, Aug 2, 2019 at 1:28 PM Julien Rouhaud wrote: > I'm fine with it! Pushed a version with similar wording just now. Thanks! -- Peter Geoghegan
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
Greetings, On Mon, Aug 5, 2019 at 14:38 Robert Haas wrote: > On Mon, Aug 5, 2019 at 2:29 PM Tom Lane wrote: > > Robert Haas writes: > > > All we need to do to resolve this issue is have Tom commit his patch. > > > > I think Stephen is not being unreasonable to suggest that we need some > > documentation about what external tools may safely do to pg.auto.conf. > > So somebody's got to write that. > > I mean, really? We're going to document that if you want to add a > setting to the file, you can just append it, but that if you find > yourself desirous of appending so many settings that the entire disk > will fill up, you should maybe reconsider? Perhaps I'm being mean > here, but that seems like it's straight out of the > blinding-flashes-of-the-obvious department. > > If we were going to adopt Stephen's proposed rule that you must remove > duplicates or be punished later with an annoying WARNING, I would > agree that it ought to be documented, because I believe many people > would find that a POLA violation. And to be clear, I'm not objecting > to a sentence someplace that says that duplicates in > postgresql.auto.conf will be ignored and the last value will be used, > same as for any other PostgreSQL configuration file. However, I think > it's highly likely people would have assumed that anyway. The authors and committer for ALTER SYSTEM didn’t. It’s not uncommon for us to realize when different people and/or parts of the system make different assumption about something and they end up causing bugs, we try to document the “right way” and what expectations one can have. Also, to be clear, if we document it then I don’t feel we need a WARNING to be issued- because then it’s expected and handled. Yes, there was a lot of discussion about how it’d be nice to go further than documentation and actually provide a facility for tools to use to modify the file, so we could have the same code being used by pg_basebackup and ALTER SYSTEM, but the argument was made that we can’t make that happen for v12. I’m not sure I agree with that because a lot of the responses have been blowing up the idea of what amounts to a simple parser/editor of PG config files (which, as Christoph pointed out, has already been done externally and I doubt it’s actually all that’s much code) to a full blown we must validate everything and load every extension’s .so file to make sure the value is ok, but even so, I’ve backed away from that position and agreed that a documentation fix would be enough for v12 and hopefully someone will revisit it in the future and improve on it- at least with the documentation, there would be a higher chance that they’d get it right and not end up making different assumptions than those made by other hackers. Thanks, Stephen
Re: The unused_oids script should have a reminder to use the 8000-8999 OID range
On Mon, Aug 5, 2019 at 8:51 PM Peter Geoghegan wrote: > > Pushed a version with similar wording just now. Thanks!
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
Greetings, On Mon, Aug 5, 2019 at 14:43 Tom Lane wrote: > Robert Haas writes: > > On Mon, Aug 5, 2019 at 2:29 PM Tom Lane wrote: > >> I think Stephen is not being unreasonable to suggest that we need some > >> documentation about what external tools may safely do to pg.auto.conf. > >> So somebody's got to write that. > > > I mean, really? We're going to document that if you want to add a > > setting to the file, you can just append it, but that if you find > > yourself desirous of appending so many settings that the entire disk > > will fill up, you should maybe reconsider? Perhaps I'm being mean > > here, but that seems like it's straight out of the > > blinding-flashes-of-the-obvious department. > > I don't think we need to go on about it at great length, but it seems > to me that it'd be reasonable to point out that (a) you'd be well > advised not to touch the file while the postmaster is up, and (b) > last setting wins. Those things are equally true of postgresql.conf > of course, but I don't recall whether they're already documented. Folks certainly modify postgresql.conf while the postmaster is running pretty routinely, and we expect them to which is why we have a reload option, so I don’t think we can say that the auto.conf and postgresql.conf are to be handled in the same way. Last setting wins, duplicates should be ignored and may be removed, comments should be ignored and may be removed, and appending to the file is acceptable for modifying a value. I’m not sure how much we really document the structure of the file itself offhand- back when users were editing it we could probably be a bit more fast and loose with it, but now that we have different parts of the system modifying it along with external tools doing so, we should probably write it down a bit more clearly/precisely. I suspect the authors of pg_conftool would appreciate that too, so they could make sure that they aren’t doing anything unexpected or incorrect. Thanks, Stephen >
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
Robert Haas writes: > But that's not the problem. The problem is that ALTER SYSTEM modifies > the first match instead of the last one, when it's a well-established > principle that when reading from a PostgreSQL configuration file, the > system adopts the value from the last match, not the first one. I > admit that if somebody had thought to document what ALTER SYSTEM was > doing, that person would probably have also realized that they had > made a mistake in the code, and then they would have fixed the bug, > and that would be great. Well, actually, the existing code is perfectly clear about this: /* Search the list for an existing match (we assume there's only one) */ That assumption is fine *if* you grant that only ALTER SYSTEM itself is authorized to write that file. I think the real argument here centers around who else is authorized to write the file, and when and how. In view of the point you made upthread that we explicitly made pg.auto.conf a plain text file so that one could recover from mistakes by hand-editing it, I think it's pretty silly to adopt a position that external mods are disallowed. regards, tom lane
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
Robert Haas writes: > On Mon, Aug 5, 2019 at 2:42 PM Tom Lane wrote: >> I don't think we need to go on about it at great length, but it seems >> to me that it'd be reasonable to point out that (a) you'd be well >> advised not to touch the file while the postmaster is up, and (b) >> last setting wins. Those things are equally true of postgresql.conf >> of course, but I don't recall whether they're already documented. > OK, fair enough. Concretely, how about the attached? (Digging around in config.sgml, I found that last-one-wins is stated, but only in the context of one include file overriding another. That's not *directly* a statement about what happens within a single file, and it's in a different subsection anyway, so repeating the info in 19.1.2 doesn't seem unreasonable.) regards, tom lane diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index cdc30fa..f5986b2 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -153,6 +153,8 @@ shared_buffers = 128MB identifiers or numbers must be single-quoted. To embed a single quote in a parameter value, write either two quotes (preferred) or backslash-quote. + If the file contains multiple entries for the same parameter, + all but the last one are ignored. @@ -185,18 +187,27 @@ shared_buffers = 128MB In addition to postgresql.conf, a PostgreSQL data directory contains a file postgresql.auto.confpostgresql.auto.conf, - which has the same format as postgresql.conf but should - never be edited manually. This file holds settings provided through - the command. This file is automatically - read whenever postgresql.conf is, and its settings take - effect in the same way. Settings in postgresql.auto.conf - override those in postgresql.conf. + which has the same format as postgresql.conf but + is intended to be edited automatically not manually. This file holds + settings provided through the command. + This file is read whenever postgresql.conf is, + and its settings take effect in the same way. Settings + in postgresql.auto.conf override those + in postgresql.conf. + + + + External tools might also + modify postgresql.auto.conf, typically by appending + new settings to the end. It is not recommended to do this while the + server is running, since a concurrent ALTER SYSTEM + command could overwrite such changes. The system view pg_file_settings - can be helpful for pre-testing changes to the configuration file, or for + can be helpful for pre-testing changes to the configuration files, or for diagnosing problems if a SIGHUP signal did not have the desired effects.
Cleanup of intro.sgml
-hackers, I went through and made some readability and modernization of the intro.sgml today. Patch attached. JD -- Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc Postgres centered full stack support, consulting and development. Advocate: @amplifypostgres || Get help: https://commandprompt.com/ * Unless otherwise stated, opinions are my own. * diff --git a/doc/src/sgml/intro.sgml b/doc/src/sgml/intro.sgml index 3038826311..7b23a8a918 100644 --- a/doc/src/sgml/intro.sgml +++ b/doc/src/sgml/intro.sgml @@ -4,78 +4,77 @@ Preface - This book is the official documentation of + This is the official documentation of PostgreSQL. It has been written by the - PostgreSQL developers and other - volunteers in parallel to the development of the - PostgreSQL software. It describes all - the functionality that the current version of - PostgreSQL officially supports. + PostgreSQL community in parallel with the + development of the PostgreSQL database. + It describes the functionality of current version of + PostgreSQL. - + - To make the large amount of information about - PostgreSQL manageable, this book has been - organized in several parts. Each part is targeted at a different - class of users, or at users in different stages of their + This book has been broken up in to categories to better manage + the large amount of information about + PostgreSQL. Each part is targeted at + a different category of user, or at users in different stages of their PostgreSQL experience: is an informal introduction for new users. + It includes information on topics such as installation, architecture + and how to create your first database. documents the SQL query - language environment, including data types and functions, as well - as user-level performance tuning. Every - PostgreSQL user should read this. + language, including data types and functions. It also includes + user-level performance tuning hints. All + PostgreSQL users should read this. describes the installation and - administration of the server. Everyone who runs a - PostgreSQL server, be it for private - use or for others, should read this part. + administration of the PostgreSQL server. + Everyone who administrates a + PostgreSQL server, should read this part. - describes the programming - interfaces for PostgreSQL client - programs. + describes the C based programming + interfaces for PostgreSQL. Topics such as + Python or Go are not discussed in this section. - contains information for - advanced users about the extensibility capabilities of the - server. Topics include user-defined data types and - functions. + contains information on + the extensibility of PostgreSQL. + server. Topics include user-defined data types, procedural languages, + and triggers. - contains reference information about - SQL commands, client and server programs. This part supports - the other parts with structured information sorted by command or - program. + contains reference information including + SQL commands, client applications and server applications. contains assorted information that might be of - use to PostgreSQL developers. + use to PostgreSQL internals developers. @@ -120,6 +119,29 @@ + PostgreSQL also supports a comprehensive array of Enterprise level + features: + + + + Unstructured data via JSON + + Binary and Logical Replication + + + Hot Backups + + + Partioniing + + + Materialized Views + + + Procedures + + + Also, PostgreSQL can be extended by the user in many ways, for example by adding new @@ -146,8 +168,8 @@ - And because of the liberal license, - PostgreSQL can be used, modified, and + The liberal license allows + PostgreSQL to be used, modified, and distributed by anyone free of charge for any purpose, be it private, commercial, or academic.
Re: Adding column "mem_usage" to view pg_prepared_statements
Am 05.08.2019 um 18:30 schrieb Konstantin Knizhnik: On 31.07.2019 1:38, Daniel Migowski wrote: Am 31.07.2019 um 00:29 schrieb Tom Lane: Daniel Migowski writes: Ok, just have read about the commitfest thing. Is the patch OK for that? Or is there generally no love for a mem_usage column here? If it was, I really would add some memory monitoring in our app regarding this. You should certainly put it into the next commitfest. We might or might not accept it, but if it's not listed in the CF we might not remember to even review it. (The CF app is really a to-do list for patches ...) OK, added it there. Thanks for your patience :). The patch is not applied to the most recent source because extra parameter was added to CreateTemplateTupleDesc method. Please rebase it - the fix is trivial. OK, will do. I think that including in pg_prepared_statements information about memory used this statement is very useful. CachedPlanMemoryUsage function may be useful not only for this view, but for example it is also need in my autoprepare patch. I would love to use your work if it's done, and would also love to work together here. I am quite novice in C thought, I might take my time to get things right. I wonder if you consider go further and not only report but control memory used by prepared statements? For example implement some LRU replacement discipline on top of prepared statements cache which can evict rarely used prepared statements to avoid memory overflow. THIS! Having some kind of safety net here would finally make sure that my precious processes will not grow endlessly until all mem is eaten up, even with prep statement count limits. While working on stuff I noticed there are three things stored in a CachedPlanSource. The raw query tree (a relatively small thing), the query list (analyzed-and-rewritten query tree) which takes up the most memory (at least here, maybe different with your usecases), and (often after the 6th call) the CachedPlan, which is more optimized that the query list and often needs less memory (half of the query list here). The query list seems to take the most time to create here, because I hit the GEQO engine here, but it could be recreated easily (up to 500ms for some queries). Creating the CachedPlan afterwards takes 60ms in some usecase. IF we could just invalidate them from time to time, that would be grate. Also, invalidating just the queries or the CachedPlan would not invalidate the whole prepared statement, which would break clients expectations, but just make them a slower, adding much to the stability of the system. I would pay that price, because I just don't use manually named prepared statements anyway and just autogenerate them as performance sugar without thinking about what really needs to be prepared anyway. There is an option in the JDBC driver to use prepared statements automatically after you have used them a few time. We have such patch for PgPro-EE but it limits only number of prepared statement, not taken in account amount of memory used by them. I think that memory based limit will be more accurate (although it adds more overhead). Limiting them by number is already done automatically here and would really not be of much value, but having a mem limit would be great. We could have a combined memory limit for your autoprepared statements as well as the manually prepared ones, so clients can know for sure the server processes won't eat up more that e.g. 800MB for prepared statements. And also I would like to have this value spread across all client processes, e.g. specifying max_prepared_statement_total_mem=5G for the server, and maybe max_prepared_statement_mem=1G for client processes. Of course we would have to implement cross client process invalidation here, and I don't know if communicating client processes are even intended. Anyway, a memory limit won't really add that much more overhead. At least not more than having no prepared statements at all because of the fear of server OOMs, or have just a small count of those statements. I was even think about a prepared statement reaper that checks the pg_prepared_statements every few minutes to clean things up manually, but having this in the server would be of great value to me. If you want, I can be reviewer of your patch. I'd love to have you as my reviewer. Regards, Daniel Migowski
Re: tableam vs. TOAST
On Fri, Aug 2, 2019 at 6:42 PM Andres Freund wrote: > Hm, those all include writing, right? And for read-only we don't expect > any additional overhead, correct? The write overhead is probably too > large show a bit of function call overhead - but if so, it'd probably be > on unlogged tables? And with COPY, because that utilizes multi_insert, > which means more toasting in a shorter amount of time? Yes and yes. I guess we could test the unlogged case and with COPY, but in any realistic case you're still looking for a tiny CPU overhead in a sea of I/O costs. Even if an extremely short COPY on an unlogged table regresses slightly, we do not normally reject patches that improve code quality on the grounds that they add function call overhead in a few places. Code like this hard to optimize and maintain; as you remarked yourself, there are multiple opportunities to do this stuff better that are hard to see in the current structure. > .oO(why does everyone attach attachements out of order? Is that > a gmail thing?) Must be. > I wonder if toasting.c should be moved too? I mean, we could, but I don't really see a reason. It'd just be moving it from one fairly-generic place to another, and I'd rather minimize churn. > trailing whitespace after "#ifndef DETOAST_H ". Will fix. > Hm. This leaves toast_insert_or_update() as a name. That makes it sound > like it's generic toast code, rather than heap specific? I'll rename it to heap_toast_insert_or_update(). But I think I'll put that in 0004 with the other renames. > It's definitely nice how a lot of repetitive code has been deduplicated. > Also makes it easier to see how algorithmically expensive > toast_insert_or_update() is :(. Yep. > Shouldn't this condition be the other way round? I had to fight pretty hard to stop myself from tinkering with the algorithm -- this can clearly be done better, but I wanted to make it match the existing structure as far as possible. It also only needs to be tested once, not on every loop iteration, so if we're going to start changing things, we should go further than just swapping the order of the tests. For now I prefer to draw a line in the sand and change nothing. > Couldn't most of these be handled via colflags, instead of having > numerous individual checks? I.e. if you had TOASTCOL_COMPRESSED, > TOASTCOL_IGNORED, TOASTCOL_MAIN, TOASTFLAG_EXTERNAL, etc, all but the > size check ought to boil down to a single mask test? I'm not really seeing how more flags would significantly simplify this logic, but I might be missing something. > I wonder if a better prefix wouldn't be toasting_... I'm open to other votes, but I think it's toast_tuple is more specific than toasting_ and thus better. > > +/* > > + * Information about one column of a tuple being toasted. > > + * > > + * NOTE: toast_action[i] can have these values: > > + * ' ' default handling > > + * 'p' already processed --- don't touch it > > + * 'x' incompressible, but OK to move off > > + * > > + * NOTE: toast_attr[i].tai_size is only made valid for varlena attributes > > with > > + * toast_action[i] different from 'p'. > > + */ > > +typedef struct > > +{ > > +struct varlena *tai_oldexternal; > > +int32 tai_size; > > +uint8 tai_colflags; > > +} ToastAttrInfo; > > I think that comment is outdated? Oops. Will fix. > > +/* > > + * Flags indicating the overall state of a TOAST operation. > > + * > > + * TOAST_NEEDS_DELETE_OLD indicates that one or more old TOAST datums need > > + * to be deleted. > > + * > > + * TOAST_NEEDS_FREE indicates that one or more TOAST values need to be > > freed. > > + * > > + * TOAST_HAS_NULLS indicates that nulls were found in the tuple being > > toasted. > > + * > > + * TOAST_NEEDS_CHANGE indicates that a new tuple needs to built; in other > > + * words, the toaster did something. > > + */ > > +#define TOAST_NEEDS_DELETE_OLD 0x0001 > > +#define TOAST_NEEDS_FREE0x0002 > > +#define TOAST_HAS_NULLS 0x0004 > > +#define TOAST_NEEDS_CHANGE 0x0008 > > I'd make these enums. They're more often accessible in a debugger... Ugh, I hate that style. Abusing enums to make flag bits makes my skin crawl. I always wondered what the appeal was -- I guess now I know. Blech. > I'm quite unconvinced that making the chunk size specified by the AM is > a good idea to do in isolation. We have TOAST_MAX_CHUNK_SIZE in > pg_control etc. It seems a bit dangerous to let AMs provide the size, > without being very clear that any change of the value will make data > inaccessible. It'd be different if the max were only used during > toasting. I was actually thinking about proposing that we rip TOAST_MAX_CHUNK_SIZE out of pg_control. No real effort has been made here to make this something that users could configure, and I don't know of a good reason to configure it. It also seems pretty out of place in a world where there are mult
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
On Mon, Aug 5, 2019 at 3:07 PM Tom Lane wrote: > Concretely, how about the attached? Works for me, for whatever that's worth. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: tableam vs. TOAST
Robert Haas writes: > I don't feel entirely convinced that there's any rush to do all of > this right now, and the more I change the harder it is to make sure > that I haven't broken anything. How strongly do you feel about this > stuff? FWIW, I agree with your comment further up that this patch ought to just refactor, not change any algorithms. The latter can be done in separate patches. regards, tom lane
Re: pgbench - implement strict TPC-B benchmark
On Fri, Aug 2, 2019 at 2:38 AM Fabien COELHO wrote: > Ok, one thread cannot feed an N core server if enough client are executed > per thread and the server has few things to do. Right ... where N is, uh, TWO. > The point I'm clumsily trying to make is that pgbench-specific overheads > are quite small: Any benchmark driver would have pretty much at least the > same costs, because you have the cpu cost of the tool itself, then the > library it uses, eg lib{pq,c}, then syscalls. Even if the first costs are > reduced to zero, you still have to deal with the database through the > system, and this part will be the same. I'm not convinced. Perhaps you're right; after all, it's not like pgbench is doing any real work. On the other hand, I've repeatedly been annoyed by how inefficient pgbench is, so I'm not totally prepared to concede that any benchmark driver would have the same costs, or that it's a reasonably well-optimized client application. When I run the pgbench, I want to know how fast the server is, not how fast pgbench is. > What name would you suggest, if it were to be made available from pgbench > as a builtin, that avoids confusion with "tpcb-like"? I'm not in favor of adding it as a built-in. If we were going to do it, I don't know that we could do better than tcpb-like-2, and I'm not excited about that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Cleanup of intro.sgml
On 8/5/19 3:20 PM, Joshua D. Drake wrote: > intro.sgml today. Patch attached. Things I noticed quickly: broken up in to categoriess/in to/into/ Unstructured data via JSON(or XML ?) s/Partioniing/Partitioning/ Regards, -Chap
Re: Redacting information from logs
On Sat, 2019-08-03 at 19:14 -0400, Tom Lane wrote: > It seems to me that it'd be sufficient to do the annotation by > inserting wrapper functions, like the errparam() you suggest above. > If we just had errparam() choosing whether to return "..." instead of > its argument string, we'd have what we need, without messing with > the format language. I'm having trouble getting the ergonomics to work out here so that it can generate both a redacted and an unredacted message. If errparam() is a normal argument to errmsg(), then errparam() will be evaluated first. Will it return the redacted version, the unredacted version, or a special type that holds both? If I try to use macros to force multiple evaluation (to get one redacted and one unredacted string), then it seems like that would happen for all arguments (not just errparam arguments), which would be bad. Suggestions? Regards, Jeff Davis
Re: Index Skip Scan
> On Mon, Aug 5, 2019 at 12:05 PM Floris Van Nee > wrote: > > The root->distinct_pathkeys is already filtered for redundant keys, so column > 'a' is not in there anymore. Still, it'd be much faster to use the skip scan > here, because a regular scan will scan all entries with a=1, even though > we're really only interested in the first one. In previous versions, this > would be fixed by changing the check in planner.c to use > root->uniq_distinct_pathkeys instead of root->distinct_pathkeys, but things > change a bit now that the patch is rebased on the unique-keys patch. Would it > be valid to change this check to root->query_uniquekeys != NIL to consider > skip scans also for this query? [including a commentary from Jesper] On Mon, Aug 5, 2019 at 6:55 PM Jesper Pedersen wrote: Yes, the check should be for that. However, the query in question doesn't have any query_pathkeys, and hence query_uniquekeys in standard_qp_callback(), so therefore it isn't supported. Your scenario is covered by one of the test cases in case the functionality is supported. But, I think that is outside the scope of the current patch. > However, currently, the patch just reads the closest and then doesn't > consider this page at all anymore, if the first tuple skipped to turns out to > be not visible. Consider the following sql to illustrate: For the records, the purpose of `_bt_read_closest` is not so much to reduce amount of data we read from a page, but more to correctly handle those situations we were discussing before with reading forward/backward in cursors, since for that in some cases we need to remember previous values for stepping to the next. I've limited number of items, fetched in this function just because I was misled by having a check for dead tuples in `_bt_skip`. Of course we can modify it to read a whole page and leave visibility check for the code after `index_getnext_tid` (although in case if we know that all tuples on this page are visilbe I guess it's not strictly necessary, but we still get improvement from skipping itself).
Re: pgbench - implement strict TPC-B benchmark
Hello Andres, Which is a (somehow disappointing) * 3.3 speedup. The impact on the 3 complex expressions tests is not measurable, though. I don't know why that could be disappointing. We put in much more work for much smaller gains in other places. Probably, but I thought I would have a better deal by eliminating most string stuff from variables. Questions: - how likely is such a patch to pass? (IMHO not likely) I don't see why? I didn't review the patch in any detail, but it didn't look crazy in quick skim? Increasing how much load can be simulated using pgbench, is something I personally find much more interesting than adding capabilities that very few people will ever use. Yep, but my point is that the bottleneck is mostly libpq/system, as I tried to demonstrate with the few experiments I reported. FWIW, the areas I find current pgbench "most lacking" during development work are: 1) Data load speed. The data creation is bottlenecked on fprintf in a single process. snprintf actually, could be replaced. I submitted a patch to add more control on initialization, including a server-side loading feature, i.e. the client does not send data, the server generates its own, see 'G': https://commitfest.postgresql.org/24/2086/ However on my laptop it is slower than client-side loading on a local socket. The client version is doing around 70 MB/s, the client load is 20-30%, postgres load is 85%, but I'm not sure I can hope for much more on my SSD. On my laptop the bottleneck is postgres/disk, not fprintf. The index builds are done serially. The vacuum could be replaced by COPY FREEZE. Well, it could be added? For a lot of meaningful tests one needs 10-1000s of GB of testdata - creating that is pretty painful. Yep. 2) Lack of proper initialization integration for custom scripts. Hmmm… You can always write a psql script for schema and possibly simplistic data initialization? However, generating meaningful pseudo-random data for an arbitrary schema is a pain. I did an external tool for that a few years ago: http://www.coelho.net/datafiller.html but it is still a pain. I.e. have steps that are in the custom script that allow -i, vacuum, etc to be part of the script, rather than separately executable steps. --init-steps doesn't do anything for that. Sure. It just gives some control. 3) pgbench overhead, although that's to a significant degree libpq's fault I'm afraid that is currently the case. 4) Ability to cancel pgbench and get approximate results. That currently works if the server kicks out the clients, but not when interrupting pgbench - which is just plain weird. Obviously that doesn't matter for "proper" benchmark runs, but often during development, it's enough to run pgbench past some events (say the next checkpoint). Do you mean have a report anyway on "Ctrl-C"? I usually do a -P 1 to see the progress, but making Ctrl-C work should be reasonably easy. - what is its impact to overall performance when actual queries are performed (IMHO very small). Obviously not huge - I'd also not expect it to be unobservably small either. Hmmm… Indeed, the 20 \set script runs at 2.6 M/s, that is 0.019 µs per \set, and any discussion over the connection is at least 15 µs (for one client on a local socket). -- Fabien.
Re: Adding column "mem_usage" to view pg_prepared_statements
Am 05.08.2019 um 19:16 schrieb Andres Freund: On 2019-07-28 06:20:40 +, Daniel Migowski wrote: how do you want to generalize it? Are you thinking about a view solely for the display of the memory usage of different objects? I'm not quite sure. I'm just not sure that adding separate infrastructure for various objects is a sutainable approach. We'd likely want to have this for prepared statements, for cursors, for the current statement, for various caches, ... I think an approach would be to add an 'owning_object' field to memory contexts, which has to point to a Node type if set. A table returning reporting function could recursively walk through the memory contexts, starting at TopMemoryContext. Whenever it encounters a context with owning_object set, it prints a string version of nodeTag(owning_object). For some node types it knows about (e.g. PreparedStatement, Portal, perhaps some of the caches), it prints additional metadata specific to the type (so for prepared statement it'd be something like 'prepared statement', '[name of prepared statement]'), and prints information about the associated context and all its children. I understand. So it would be something like the output of MemoryContextStatsInternal, but in table form with some extra columns. I would have loved this extra information already in MemoryContextStatsInternal btw., so it might be a good idea to upgrade it first to find the information and wrap a table function over it afterwards. The general context information probably should be something like: context_name, context_ident, context_total_bytes, context_total_blocks, context_total_freespace, context_total_freechunks, context_total_used, context_total_children context_self_bytes, context_self_blocks, context_self_freespace, context_self_freechunks, context_self_used, context_self_children, It might make sense to have said function return a row for the contexts it encounters that do not have an owner set too (that way we'd e.g. get CacheMemoryContext handled), but then still recurse. A nice way to learn about the internals of the server and to analyze the effects of memory reducing enhancements. Arguably the proposed owning_object field would be a bit redundant with the already existing ident/MemoryContextSetIdentifier field, which e.g. already associates the query string with the contexts used for a prepared statement. But I'm not convinced that's going to be enough context in a lot of cases, because e.g. for prepared statements it could be interesting to have access to both the prepared statement name, and the statement. The identifier seems to be more like a category at the moment, because it does not seem to hold any relevant information about the object in question. So a more specific name would be nice. The reason I like something like this is that we wouldn't add new columns to a number of views, and lack views to associate such information to for some objects. And it'd be disproportional to add all the information to numerous places anyway. I understand your argumentation, but things like Cursors and Portals are rather short living while prepared statements seem to be the place where memory really builds up. One counter-argument is that it'd be more expensive to get information specific to prepared statements (or other object types) that way. I'm not sure I buy that that's a problem - this isn't something that's likely going to be used at a high frequency. But if it becomes a problem, we can add a function that starts that process at a distinct memory context (e.g. a function that does this just for a single prepared statement, identified by name) - but I'd not start there. I also see no problem here, and with Konstantin Knizhnik's autoprepare I wouldn't use this very often anyway, more just for monitoring purposes, where I don't care if my query is a bit more complex. While being interesting I still believe monitoring the mem usage of prepared statements is a bit more important than that of other objects because of how they change memory consumption of the server without using any DDL or configuration options and I am not aware of other objects with the same properties, or are there some? And for the other volatile objects like tables and indexes and their contents PostgreSQL already has it's information functions. Plenty other objects have that property. E.g. cursors. And for the catalog/relation/... caches it's even more pernicious - the client might have closed all its "handles", but we still use memory (and it's absolutely crucial for performance). Maybe we can do both? Add a single column to pg_prepared_statements, and add another table for the output of MemoryContextStatsDetail? This has the advantage that the single real memory indicator useful for end users (to the question: How much mem takes my sh*t up?) is in pg_prepared_statements and some more intrinsic information in a detail view. Thinking about the latter I am against such a
Re: Adding column "mem_usage" to view pg_prepared_statements
Hi, On 2019-08-05 22:46:47 +0200, Daniel Migowski wrote: > > Arguably the proposed owning_object field would be a bit redundant with > > the already existing ident/MemoryContextSetIdentifier field, which > > e.g. already associates the query string with the contexts used for a > > prepared statement. But I'm not convinced that's going to be enough > > context in a lot of cases, because e.g. for prepared statements it could > > be interesting to have access to both the prepared statement name, and > > the statement. > The identifier seems to be more like a category at the moment, because it > does not seem to hold any relevant information about the object in question. > So a more specific name would be nice. I think you might be thinking of the context's name, not ident? E.g. for prepared statements the context name is: source_context = AllocSetContextCreate(CurrentMemoryContext, "CachedPlanSource", ALLOCSET_START_SMALL_SIZES); which is obviously the same for every statement. But then there's MemoryContextSetIdentifier(source_context, plansource->query_string); which obviously differs. > > The reason I like something like this is that we wouldn't add new > > columns to a number of views, and lack views to associate such > > information to for some objects. And it'd be disproportional to add all > > the information to numerous places anyway. > I understand your argumentation, but things like Cursors and Portals are > rather short living while prepared statements seem to be the place where > memory really builds up. That's not necessarily true, especially given WITH HOLD cursors. Nor does one only run out of memory in the context of long-lived objects. > > > While being interesting I still believe monitoring the mem usage of > > > prepared statements is a bit more important than that of other objects > > > because of how they change memory consumption of the server without > > > using any DDL or configuration options and I am not aware of other > > > objects with the same properties, or are there some? And for the other > > > volatile objects like tables and indexes and their contents PostgreSQL > > > already has it's information functions. > > Plenty other objects have that property. E.g. cursors. And for the > > catalog/relation/... caches it's even more pernicious - the client might > > have closed all its "handles", but we still use memory (and it's > > absolutely crucial for performance). > > Maybe we can do both? Add a single column to pg_prepared_statements, and add > another table for the output of MemoryContextStatsDetail? This has the > advantage that the single real memory indicator useful for end users (to the > question: How much mem takes my sh*t up?) is in pg_prepared_statements and > some more intrinsic information in a detail view. I don't see why we'd want to do both. Just makes pg_prepared_statements a considerably more expensive. And that's used by some applications / clients in an automated manner. > Thinking about the latter I am against such a table, at least in the form > where it gives information like context_total_freechunks, because it would > just be useful for us developers. Developers are also an audience for us. I mean we certainly can use this information during development. But even for bugreports such information would be useufl. > Why should any end user care for how many > chunks are still open in a MemoryContext, except when he is working on > C-style extensions. Could just be a source of confusion for them. Meh. As long as the crucial stuff is first, that's imo enough. > Let's think about the goal this should have: The end user should be able to > monitor the memory consumption of things he's in control of or could affect > the system performance. Should such a table automatically aggregate some > information? I think so. I would not add more than two memory columns to the > view, just mem_used and mem_reserved. And even mem_used is questionable, > because in his eyes only the memory he cannot use for other stuff because of > object x is important for him (that was the reason I just added one column). > He would even ask: WHY is there 50% more memory reserved than used, and how > I can optimize it? (Would lead to more curious PostgreSQL developers maybe, > so that's maybe a plus). It's important because it influences how memory usage will grow. > On the other hand: The Generic Plan had been created for the first > invocation of the prepared statement, why not store it immediatly. It is a > named statement for a reason that it is intended to be reused, even when it > is just twice, and since memory seems not to be seen as a scarce resource in > this context why not store that immediately. Would drop the need for a > hierarchy here also. Well, we'll maybe never use
Re: Redacting information from logs
Hi, On 2019-08-05 13:37:50 -0700, Jeff Davis wrote: > On Sat, 2019-08-03 at 19:14 -0400, Tom Lane wrote: > > It seems to me that it'd be sufficient to do the annotation by > > inserting wrapper functions, like the errparam() you suggest above. > > If we just had errparam() choosing whether to return "..." instead of > > its argument string, we'd have what we need, without messing with > > the format language. > > I'm having trouble getting the ergonomics to work out here so that it > can generate both a redacted and an unredacted message. > > If errparam() is a normal argument to errmsg(), then errparam() will be > evaluated first. Will it return the redacted version, the unredacted > version, or a special type that holds both? I was thinking that it'd just store a struct ErrParam, which'd reference the passed value and metadata like the name (for structured log output) and redaction category. The bigger problem I see is handling the different types of arguments - but perhaps the answer there would be to just make the macro typesafe? Or have one for scalar values and one for pointer types? We could even allocate the necessary information for this on the stack, with some macro trickery. Not sure if it's worth it, but ... Doing something like this does however require not directly using plain sprintf. I'm personally not bothered by that, but I'd not be surprised if e.g. Tom saw that differently. > If I try to use macros to force multiple evaluation (to get one > redacted and one unredacted string), then it seems like that would > happen for all arguments (not just errparam arguments), which would be > bad. Yea, multiple evaluation clearly is a no-go. Greetings, Andres Freund
Re: Redacting information from logs
On Sun, 2019-08-04 at 11:17 -0700, Andres Freund wrote: > I'm imagining something like > > #define pg_printf(fmt, ...) \ > do { \ > if ( __builtin_constant_p(fmt)) \ > { \ > static processed_fmt processed_fmt_ = {.format = fmt}; \ > if (unlikely(!processed_fmt_.parsed)) \ >preprocess_format_string(&processed_fmt_) \ > pg_printf_processed(&processed_fmt_, __VA_ARGS__); \ > } \ > else \ > pg_printf_unprocessed(fmt, __VA_ARGS__); \ > } while (0) \ What would you do in the preprocessing exactly? Create a list of indexes into the string where the format codes are? > I think we've quite repeatedly had requests for a log format that can > be > parsed reasonably, and annotated with additional information that's > too > verbose for the main message. It's a pain to have to parse quite > complex strings in production, just to extract the value of the > actual > information in the message. > > ts="2019-08-04 10:19:21.286 PDT" pid=32416 vxid=12/9 sev=LOG \ > msg="Prozess 15027 erlangte AccessShareLock-Sperre auf Relation > 1259 der Datenbank 13416 nach 29808,865 ms" \ > fmt="process %d acquired %s on %s after %ld.%03d ms" \ > p:modename="AccessShareLock-Sperre auf Relation 1259" \ > p:msec=18343 p:usec=173 \ > p:locktag=190345/134343/0/0/relation/default \ > c:statement="LOCK TABLE pg_class;" \ > l:func=ProcSleep l:file=proc.c l:line=1495 > ... > If we have enough information to pass to the logging hook, we don't > even > need to define how all of this is going to look like exactly > (although > I'd probably argue that a logfmt or json target ought to be in core). I think I see where you are going with this now: it is almost orthogonal to your new-style format strings ( %{foo}s ), but not quite. You're suggesting that we save all of the arguments, along with some annotation, in the ErrorData struct, and then let emit_log_hook sort it out (perhaps by constructing some JSON message, perhaps translating the message_id, etc.). I like the idea, but still playing with the ergonomics a bit, and how it interacts with various message parts (msg, hint, detail, etc.). If we had the name-based format strings, then the message parts could all share a common set of parameters; but with the current positional format strings, I think each message part would need its own set of arguments. Positional: ereport(ERROR, (errcode(ERRCODE_UNIQUE_VIOLATION), errmsg_params("duplicate key value violates unique constraint \"%s\"", errparam("constraintname", MSGDEFAULT, RelationGetRelationName(rel)), errdetail_params("Key %s already exists.", errparam("key", MSGUSERDATA, key_desc))) ); Named: ereport(ERROR, (errcode(ERRCODE_UNIQUE_VIOLATION), errmsg_rich("duplicate key value violates unique constraint \"%{constraintname}s\""), errdetail_rich("Key %{key}s already exists."), errparam("key", MSGUSERDATA, key_desc)) errparam("constraintname", MSGDEFAULT, RelationGetRelationName(rel))) ); I think either one needs some ergonomic improvements, but it seems like we are going in the right direction. Maybe we can make the parameters common to different message parts by using an integer index to reference the parameter, like: ereport(ERROR, (errcode(ERRCODE_UNIQUE_VIOLATION), errmsg_rich("duplicate key value violates unique constraint \"%s\"", 1 /* second errparam */), errdetail_rich("Key %s already exists.", 0 /* first errparam */), errparam("key", MSGUSERDATA, key_desc)) errparam("constraintname", MSGDEFAULT, RelationGetRelationName(rel))) ); Not quite ideal, but might get us closer. Regards, Jeff Davis
Re: Redacting information from logs
On Mon, 2019-08-05 at 14:10 -0700, Andres Freund wrote: > I was thinking that it'd just store a struct ErrParam, which'd > reference > the passed value and metadata like the name (for structured log > output) and redaction category. The bigger problem I see is handling > the different types of arguments - but perhaps the answer there would > be > to just make the macro typesafe? Or have one for scalar values and > one > for pointer types? Losing the compile-time checks for compatibility between format codes and arguments would be a shame. Are you saying there's a potential solution for that? > Doing something like this does however require not directly using > plain > sprintf. I'm personally not bothered by that, but I'd not be > surprised > if e.g. Tom saw that differently. It may be possible to still use sprintf if we translate the ErrParams into plain values first. Regards, Jeff Davis
Re: Redacting information from logs
Hi, On 2019-08-05 14:26:44 -0700, Jeff Davis wrote: > On Sun, 2019-08-04 at 11:17 -0700, Andres Freund wrote: > > I'm imagining something like > > > > #define pg_printf(fmt, ...) \ > > do { \ > > if ( __builtin_constant_p(fmt)) \ > > { \ > > static processed_fmt processed_fmt_ = {.format = fmt}; \ > > if (unlikely(!processed_fmt_.parsed)) \ > >preprocess_format_string(&processed_fmt_) \ > > pg_printf_processed(&processed_fmt_, __VA_ARGS__); \ > > } \ > > else \ > > pg_printf_unprocessed(fmt, __VA_ARGS__); \ > > } while (0) \ > > What would you do in the preprocessing exactly? Create a list of > indexes into the string where the format codes are? Yea, basically. If you look at snprint.c's dopr(), there's a fair bit of parsing going on that is going to stay the same from call to call in nearly all cases (and the cases where not we possibly ought to fix). And things like the $ processing for argument order, or having named arguments as I suggest, make that more pronounced. > > If we have enough information to pass to the logging hook, we don't > > even > > need to define how all of this is going to look like exactly > > (although > > I'd probably argue that a logfmt or json target ought to be in core). > > I think I see where you are going with this now: it is almost > orthogonal to your new-style format strings ( %{foo}s ), but not quite. > > You're suggesting that we save all of the arguments, along with some > annotation, in the ErrorData struct, and then let emit_log_hook sort it > out (perhaps by constructing some JSON message, perhaps translating the > message_id, etc.). Right. > I like the idea, but still playing with the ergonomics a bit, and how > it interacts with various message parts (msg, hint, detail, etc.). If > we had the name-based format strings, then the message parts could all > share a common set of parameters; but with the current positional > format strings, I think each message part would need its own set of > arguments. Right, I think that's a good part of where I was coming from. > Maybe we can make the parameters common to different message parts by > using an integer index to reference the parameter, like: > >ereport(ERROR, >(errcode(ERRCODE_UNIQUE_VIOLATION), > errmsg_rich("duplicate key value violates unique constraint > \"%s\"", 1 /* second errparam */), > errdetail_rich("Key %s already exists.", 0 /* first > errparam */), > errparam("key", MSGUSERDATA, key_desc)) > errparam("constraintname", MSGDEFAULT, > RelationGetRelationName(rel))) >); > > Not quite ideal, but might get us closer. If we insist that errmsg_rich/errdetail_rich may not have parameters, then they can just use the same set of arguments, without any of this, at the cost of sometimes more complicated % syntax (i.e. %1$d to refer to the first argument). I think the probable loss of gcc format warnings would be the biggest issue with this whole proposal, and potential translator trouble the biggest impediment for named parameters. Greetings, Andres Freund
Re: Redacting information from logs
Hi, On 2019-08-05 14:32:36 -0700, Jeff Davis wrote: > On Mon, 2019-08-05 at 14:10 -0700, Andres Freund wrote: > > I was thinking that it'd just store a struct ErrParam, which'd > > reference > > the passed value and metadata like the name (for structured log > > output) and redaction category. The bigger problem I see is handling > > the different types of arguments - but perhaps the answer there would > > be > > to just make the macro typesafe? Or have one for scalar values and > > one > > for pointer types? > > Losing the compile-time checks for compatibility between format codes > and arguments would be a shame. Are you saying there's a potential > solution for that? Yea, I'd just written that in another reply to yours. I did actually think about this some recently: https://www.postgresql.org/message-id/20190730181845.jyyk4selyohagwnf%40alap3.anarazel.de Not sure if any of that is really applicable. Once more I really am regretting that PG is C rather than C++. > > Doing something like this does however require not directly using > > plain > > sprintf. I'm personally not bothered by that, but I'd not be > > surprised > > if e.g. Tom saw that differently. > > It may be possible to still use sprintf if we translate the ErrParams > into plain values first. It's pretty hard to call those functions with a runtime variable number of arguments. va_list as an interface is not great... Greetings, Andres Freund
Re: Assertion for logically decoding multi inserts into the catalog
> On 31 Jul 2019, at 19:20, Heikki Linnakangas wrote: > This patch makes the assertion more strict than it was before. I don't see > how it could possibly make a regression failure go away. On the contrary. So, > huh? Yeah, this is clearly fat-fingered, the intent is to only run the Assert in case XLH_INSERT_CONTAINS_NEW_TUPLE is set in xlrec->flags, as it only applies under that condition. The attached is tested in both in the multi-insert patch and on HEAD, but I wish I could figure out a better way to express this Assert. cheers ./daniel logdec_assert-v2.patch Description: Binary data
Re: Redacting information from logs
On Mon, 2019-08-05 at 14:44 -0700, Andres Freund wrote: > at the cost of sometimes more complicated % syntax (i.e. %1$d to > refer > to the first argument). > > I think the probable loss of gcc format warnings would be the biggest > issue with this whole proposal, and potential translator trouble the > biggest impediment for named parameters. I'd be OK with '%1$d' syntax. That leaves type safety as the main problem. Your solution involving _Generic is interesting -- I didn't even know that existed. I don't think it needs to be supported on all compilers, as long as we are getting errors from somewhere. They would be runtime errors instead of compile-time errors, though. Regards, Jeff Davis
Putting kerberos/ldap logs somewhere useful
I got frustrated just now because this: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2019-08-05%2021%3A18%3A23 is essentially undebuggable, thanks to the buildfarm's failure to capture any error output from slapd. That's not the buildfarm script's fault: it's willing to capture everything placed in the agreed-on log directory. But the TAP test script randomly places the daemon's log file somewhere else, one level up. The kerberos test script has the same problem. Hence, I propose the attached. This just moves the actual log files ... we could possibly move the daemons' .conf files as well, but I think they're probably not variable enough to be interesting. regards, tom lane diff --git a/src/test/kerberos/t/001_auth.pl b/src/test/kerberos/t/001_auth.pl index 34845a7..e3eb052 100644 --- a/src/test/kerberos/t/001_auth.pl +++ b/src/test/kerberos/t/001_auth.pl @@ -67,8 +67,8 @@ my $realm= 'EXAMPLE.COM'; my $krb5_conf = "${TestLib::tmp_check}/krb5.conf"; my $kdc_conf= "${TestLib::tmp_check}/kdc.conf"; -my $krb5_log= "${TestLib::tmp_check}/krb5libs.log"; -my $kdc_log = "${TestLib::tmp_check}/krb5kdc.log"; +my $krb5_log= "${TestLib::log_path}/krb5libs.log"; +my $kdc_log = "${TestLib::log_path}/krb5kdc.log"; my $kdc_port= get_free_port(); my $kdc_datadir = "${TestLib::tmp_check}/krb5kdc"; my $kdc_pidfile = "${TestLib::tmp_check}/krb5kdc.pid"; diff --git a/src/test/ldap/t/001_auth.pl b/src/test/ldap/t/001_auth.pl index 47bc090..5efb87a 100644 --- a/src/test/ldap/t/001_auth.pl +++ b/src/test/ldap/t/001_auth.pl @@ -52,7 +52,7 @@ my $ldap_datadir = "${TestLib::tmp_check}/openldap-data"; my $slapd_certs = "${TestLib::tmp_check}/slapd-certs"; my $slapd_conf= "${TestLib::tmp_check}/slapd.conf"; my $slapd_pidfile = "${TestLib::tmp_check}/slapd.pid"; -my $slapd_logfile = "${TestLib::tmp_check}/slapd.log"; +my $slapd_logfile = "${TestLib::log_path}/slapd.log"; my $ldap_conf = "${TestLib::tmp_check}/ldap.conf"; my $ldap_server = 'localhost'; my $ldap_port = get_free_port();
Re: An out-of-date comment in nodeIndexonlyscan.c
On Fri, Jun 28, 2019 at 12:55 PM Ashwin Agrawal wrote: > On Thu, Jun 27, 2019 at 4:33 PM Thomas Munro wrote: >> I wonder if it might be possible to avoid page locks on both the heap >> *and* index in some limited cases, as I mentioned over here (just an >> idea, could be way off base): >> >> https://www.postgresql.org/message-id/CA%2BhUKGJGDVfhHmoUGzi-_J%2BN8FmRjmWTY0MCd1ZV5Fj9qdyb1w%40mail.gmail.com > > I am in same boat as you. One can get serializable conflict error today based > on tuple gets HOT updated or not. HOT is logically internal code optimization > and not so much user functionality, so ideally feels shouldn't affect > serializable behavior. But it does currently, again, due to index locking. > Disable HOT update and 4 isolation tests fail due to "could not serialize > access due to read/write dependencies among transactions" otherwise not. If > the tuple happens to fit on same page user will not get the error, if the > tuple gets inserted to different page the error can happen, due to index page > locking. I had discussed this with Heikki and thinking is we shouldn't need > to take the lock on the index page, if the index key was not changed, even if > it was a non-HOT update. Not sure of complications and implications, but just > a thought. Oh, I think the idea I was suggesting might be the same as this item from README-SSI (unrelated to HOT, but related to index uniqueness, particularly in index-only-scan where you might be able to skip the btree page lock): "Several optimizations are possible, though not all are implemented yet: ... * An index scan which is comparing for equality on the entire key for a unique index need not acquire a predicate lock as long as a key is found corresponding to a visible tuple which has not been modified by another transaction -- there are no "between or around" gaps to cover." -- Thomas Munro https://enterprisedb.com
Re: POC: Cleaning up orphaned files using undo logs
On Mon, Aug 5, 2019 at 12:42 PM Andres Freund wrote: > A good move in the right direction, imo. I spent some more time thinking about this and talking to Thomas about it and I'd like to propose a somewhat more aggressive restructuring proposal, with the aim of getting a cleaner separation between layers of this patch set. Right now, the undo log storage stuff knows nothing about the contents of an undo log, whereas the undo interface storage knows everything about the contents of an undo log. In particular, it knows that it's a series of records, and those records are grouped into transactions, and it knows both the format of the individual records and also the details of how transaction headers work. Nothing can use the undo log storage system except for the undo interface layer, because the undo interface layer assumes that all the data in the undo storage system conforms to the record/recordset format which it defines. However, there are a few warts: while the undo log storage patch doesn't know anything about the contents of undo logs, it does know that that transaction boundaries matter, and it signals to the undo interface layer whether a transaction header should be inserted for a new record. That's a strange thing for the storage layer to be doing. Also, in addition to three persistence levels, it knows about a fourth undo log category for "special" data for multixact or TPD-like things. That's another wart. Suppose that we instead invent a new layer which sits on top of the undo log storage layer. This layer manages what I'm going to call GHOBs, growable hunks of bytes. (This is probably not the best name, but I thought of it in 5 seconds during a complex technical conversation, so bear with me.) The GHOB layer supports open/close/grow/write/overwrite operations. Conceptually, you open a GHOB with an initial size and a persistence level, and then you can subsequently grow it unless you fill up the undo log in which case you can't grow it any more; when you're done, you close it. Opening and closing a GHOB are operations that only make in-memory state changes. Opening a GHOB finds a place where you could write the initial amount of data you specify, but it doesn't actually write any data or change any persistent state yet, except for making sure that nobody else can grab that space as long as you have the GHOB open. Closing a GHOB tells the system that you're not going to grow the object any more, which means some other GHOB can be placed immediately after the last data you wrote. Growing a GHOB doesn't do anything persistent either; it just tests whether there would be room to write those bytes. So, the only operations that make actual persistent changes are write and overwrite. These operations just copy data into shared buffers and mark them dirty, but they are set up so that you can integrate this with whatever WAL-logging your doing for those operations, so that you can make the same writes happen at redo time. Then, on top of the GHOB layer, you have separate submodules for different kinds of GHOBs. Most importantly, you have a transaction-GHOB manager, which opens a GHOB per persistence level the first time somebody wants to write to it and closes those GHOBs at end-of-xact. AMs push records into the transaction-GHOB manager, and it pushes them into GHOBs on the other side. Then you can also have a multi-GHOB manager, which would replace what Thomas now has as a separate undo log category. The undo-log-storage layer wouldn't have any fixed limit on the number of GHOBs that could be open at the same time; it would just be the sum of whatever the individual GHOB type managers can open. It would be important to keep that number fairly small since there's not an unlimited supply of undo logs, but that doesn't seem like a problem for any of the uses we currently have in mind. Each GHOB would begin with a magic number identifying the GHOB type, and would have callbacks for everything else, like "how big is this GHOB?" and "is it discardable?". I'm not totally sure I've thought through all of the problems here, but it seems like this might help us fix some of the aforementioned layering inversions. The undo log storage system only knows about storage: it doesn't have to help with things like transaction boundaries any more, and it continues to be indifferent to the actual contents of the storage. At the GHOB layer, we know that we've got chunks of storage which are the unit of undo discard, and we know that they start with a magic number that identifies the type, but it doesn't know whether they are internally broken into records or, if so, how those records are organized. The individual GHOB managers do know that stuff; for example, the transaction-GHOB manager would know that AMs insert undo records and how those records are compressed and so forth. One thing that feels good about this system is that you could actually write something like the test_undo module that Thomas ha
Re: Psql patch to show access methods info
On Wed, Jul 24, 2019 at 4:59 PM Alexander Korotkov wrote: > On Wed, Jul 24, 2019 at 9:01 AM Andres Freund wrote: > > Based on a quick skim of the thread - which means I most definitely > > missed things - there's not been discussion of why we actually want to > > add this. Who's the prospective user of this facility? And why wouldn't > > they just query pg_am[proc]? None of this information seems like it's > > going to be even remotely targeted towards even advanced users. For > > developers it's not clear what these add? > > I see your point regarding pg_am details. Probably nobody expect > developers need this. And probably even developers don't need this, > because it's easier to see IndexAmRoutine directly with more details. > So, +1 for removing this. > > pg_amproc for gin/gist/sp-gist/brin is probably for developers. But I > think pg_amproc for btree/hash could be useful for advanced users. > btree/hash opclasses could be written by advanced users using > pl/something, I've faced that several times. Revised patch is attached. Changes to \dA+ command are reverted. It also contains some minor improvements. Second patch looks problematic for me, because it provides index description alternative to \d+. IMHO, if there is something really useful to display about index, we should keep it in \d+. So, I propose to postpone this. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company 0001-Add-psql-AM-info-commands-v09.patch Description: Binary data
Re: Implementing Incremental View Maintenance
It's not mentioned below but some bugs including seg fault when --enable-casser is enabled was also fixed in this patch. BTW, I found a bug with min/max support in this patch and I believe Yugo is working on it. Details: https://github.com/sraoss/pgsql-ivm/issues/20 Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp From: Yugo Nagata Subject: Re: Implementing Incremental View Maintenance Date: Wed, 31 Jul 2019 18:08:51 +0900 Message-ID: <20190731180851.73856441d8abb494bf5e6...@sraoss.co.jp> > Hi, > > Attached is the latest patch for supporting min and max aggregate functions. > > When new tuples are inserted into base tables, if new values are small > (for min) or large (for max), matview just have to be updated with these > new values. Otherwise, old values just remains. > > However, in the case of deletion, this is more complicated. If deleted > values exists in matview as current min or max, we have to recomputate > new min or max values from base tables for affected groups, and matview > should be updated with these recomputated values. > > Also, regression tests for min/max are also added. > > In addition, incremental update algorithm of avg aggregate values is a bit > improved. If an avg result in materialized views is updated incrementally > y using the old avg value, numerical errors in avg values are accumulated > and the values get wrong eventually. To prevent this, both of sum and count > values are contained in views as hidden columns and use them to calculate > new avg value instead of using old avg values. > > Regards, > > On Fri, 26 Jul 2019 11:35:53 +0900 > Yugo Nagata wrote: > >> Hi, >> >> I've updated the wiki page of Incremental View Maintenance. >> >> https://wiki.postgresql.org/wiki/Incremental_View_Maintenance >> >> On Thu, 11 Jul 2019 13:28:04 +0900 >> Yugo Nagata wrote: >> >> > Hi Thomas, >> > >> > Thank you for your review and discussion on this patch! >> > >> > > > 2019年7月8日(月) 15:32 Thomas Munro : >> > > > >> > > > > On Fri, Jun 28, 2019 at 10:56 PM Yugo Nagata >> > > > > wrote: >> > > > > > Attached is a WIP patch of IVM which supports some aggregate >> > > > > > functions. >> > > > > >> > > > > Hi Nagata-san and Hoshiai-san, >> > > > > >> > > > > Thank you for working on this. I enjoyed your talk at PGCon. I've >> > > > > added Kevin Grittner just in case he missed this thread; he has >> > > > > talked >> > > > > often about implementing the counting algorithm, and he wrote the >> > > > > "trigger transition tables" feature to support exactly this. While >> > > > > integrating trigger transition tables with the new partition >> > > > > features, >> > > > > we had to make a number of decisions about how that should work, and >> > > > > we tried to come up with answers that would work for IMV, and I hope >> > > > > we made the right choices! >> > >> > Transition tables is a great feature. I am now using this in my >> > implementation >> > of IVM, but the first time I used this feature was when I implemented a PoC >> > for extending view updatability of PostgreSQL[1]. At that time, I didn't >> > know >> > that this feature is made originally aiming to support IVM. >> > >> > [1] https://www.pgcon.org/2017/schedule/events/1074.en.html >> > >> > I think transition tables is a good choice to implement a statement level >> > immediate view maintenance where materialized views are refreshed in a >> > statement >> > level after trigger. However, when implementing a transaction level >> > immediate >> > view maintenance where views are refreshed per transaction, or deferred >> > view >> > maintenance, we can't update views in a after trigger, and we will need an >> > infrastructure to manage change logs of base tables. Transition tables can >> > be >> > used to collect these logs, but using logical decoding of WAL is another >> > candidate. >> > In any way, if these logs can be collected in a tuplestore, we might able >> > to >> > convert this to "ephemeral named relation (ENR)" and use this to calculate >> > diff >> > sets for views. >> > >> > > > > >> > > > > I am quite interested to learn how IVM interacts with SERIALIZABLE. >> > > > > >> > > > >> > > > Nagata-san has been studying this. Nagata-san, any comment? >> > >> > In SERIALIZABLE or REPEATABLE READ level, table changes occurred in other >> > ransactions are not visible, so views can not be maintained correctly in >> > AFTER >> > triggers. If a view is defined on two tables and each table is modified in >> > different concurrent transactions respectively, the result of view >> > maintenance >> > done in trigger functions can be incorrect due to the race condition. This >> > is the >> > reason why such transactions are aborted immediately in that case in my >> > current >> > implementation. >> > >> > One idea to resolve this is performing view maintenance in two phases. >> > Firstly, >> > views
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Wed, Jul 31, 2019 at 03:29:59PM +0900, Masahiko Sawada wrote: > Just to confirm, we have 21 bits left for nonce in CTR? We have LSN (8 > bytes), page-number (4 bytes) and counter (11 bits) in 16 bytes nonce > space. Even though we have 21 bits left we cannot store relfilenode to > the IV. No. The nonce is the LSN and page number, the CTR counter (11 bits), and 21 extra bits. CTR needs a nonce for every 16-byte block, if you want to think of it that way. Even though there isn't space for the relfilenode in the nonce, We could use the relfilenode/tablespace/database id by mixing into a derived key, based on the master key, but as I stated in another email, we don't want do that unles we have to. > BTW I've received a review about the current design by some > cryptologists in our company. They recommended to use CTR rather than > CBC. The main reason is that in block cipher it's important to ensure > the uniqueness even for every input block to the block cipher. CBC is > hard to ensure that because the previous output is the next block's > input. Whereas in CTR, it encrypts each blocks separately with > key+nonce. Therefore if we can ensure the uniqueness of IV we can meet > that. Also it's not necessary to encrypt IV as it's okey to be > predictable. So I vote for CTR for at least for tables/indexes > encryption, there already might be consensus though. Yes, you are more likely to get duplicate nonce in CBC mode rather than the CTR mode we are proposing. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Wed, Jul 31, 2019 at 04:58:49PM +0900, Masahiko Sawada wrote: > On Wed, Jul 31, 2019 at 3:29 PM Masahiko Sawada wrote: > > > > > > For WAL encryption, before flushing WAL we encrypt whole 8k WAL page > > and then write only the encrypted data of the new WAL record using > > pg_pwrite() rather than write whole encrypted page. So each time we > > encrypt 8k WAL page we end up with encrypting different data with the > > same key+nonce but since we don't write to the disk other than space > > where we actually wrote WAL records it's not a problem. Is that right? > > Hmm that's incorrect. We always write an entire 8k WAL page even if we > write a few WAl records into a page. It's bad because we encrypt > different pages with the same key+IV, but we cannot change IV for each > WAL writes as we end up with changing also > already-flushed-WAL-records. So we might need to change the WAL write > so that it write only WAL records we actually wrote. Uh, I don't understand. We use the LSN to write the 8k page, and we use a different nonce scheme for the WAL. The LSN changes each time the page is modified. The 8k page in the WAL is encrypted just like the rest of the WAL. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Wed, Jul 31, 2019 at 09:43:00AM -0400, Sehrope Sarkuni wrote: > On Wed, Jul 31, 2019 at 2:32 AM Masahiko Sawada wrote: > > Just to confirm, we have 21 bits left for nonce in CTR? We have LSN (8 > bytes), page-number (4 bytes) and counter (11 bits) in 16 bytes nonce > space. Even though we have 21 bits left we cannot store relfilenode to > the IV. > > > Fields like relfilenode, database, or tablespace could be added to the derived > key, not the per-page IV. There's no space limitations as they are additional > inputs into the HKDF (key derivation function). Yes, but we want to avoid that for other reasons. > For WAL encryption, before flushing WAL we encrypt whole 8k WAL page > and then write only the encrypted data of the new WAL record using > pg_pwrite() rather than write whole encrypted page. So each time we > encrypt 8k WAL page we end up with encrypting different data with the > same key+nonce but since we don't write to the disk other than space > where we actually wrote WAL records it's not a problem. Is that right? > > Ah, this is what I was referring to in my previous mail. I'm not familiar with > how the writes happen yet (reading up...) but, yes, we would need to ensure > that encrypted data is not written more than once (i.e. no writing of encrypt > (zero) followed by writing of encrypt(non-zero) at the same spot). Right. The 8k page LSN changes each time the page is modified, and the is part of the page nonce. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Wed, Jul 31, 2019 at 04:11:03PM +0900, Masahiko Sawada wrote: > On Wed, Jul 31, 2019 at 5:48 AM Bruce Momjian wrote: > > I am thinking for the heap/index IV, it would be: > > > > uint64 lsn; > > unint32 page number; > > /* only uses 11 bits for a zero-based CTR counter for 32k pages */ > > uint32 counter; > > > > +1 > IIUC since this would require to ensure uniqueness by using key+IV we > need to use different keys for different relations. Is that right? No. My other email states that the LSN is only used for a single relation, so there is no need for the relfilenode in the nonce. A single LSN writing to multiple parts of the relation generates a unique nonce since the page number is also part of the nonce. > > and for WAL it would be: > > > > uint64 segment_number; > > uint32counter; > > /* guarantees this IV doesn't match any relation IV */ > > uint32 2^32-1 /* all 1's */ > > I would propose to include the page number within a WAL segment to IV > so that we can encrypt each WAL page with the counter always starting > from 0. And if we use different encryption keys for tables/indexes and What is the value of that? > And if we use different encryption keys for tables/indexes and > WAL I think we don't need 2^32-1. I see little value to using different encryption keys for tables/indexes and WAL. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
Greetings, * Tom Lane (t...@sss.pgh.pa.us) wrote: > Robert Haas writes: > > On Mon, Aug 5, 2019 at 2:42 PM Tom Lane wrote: > >> I don't think we need to go on about it at great length, but it seems > >> to me that it'd be reasonable to point out that (a) you'd be well > >> advised not to touch the file while the postmaster is up, and (b) > >> last setting wins. Those things are equally true of postgresql.conf > >> of course, but I don't recall whether they're already documented. > > > OK, fair enough. > > Concretely, how about the attached? > (Digging around in config.sgml, I found that last-one-wins is stated, > but only in the context of one include file overriding another. > That's not *directly* a statement about what happens within a single > file, and it's in a different subsection anyway, so repeating the > info in 19.1.2 doesn't seem unreasonable.) Agreed. > diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml > index cdc30fa..f5986b2 100644 > --- a/doc/src/sgml/config.sgml > +++ b/doc/src/sgml/config.sgml > @@ -153,6 +153,8 @@ shared_buffers = 128MB > identifiers or numbers must be single-quoted. To embed a single > quote in a parameter value, write either two quotes (preferred) > or backslash-quote. > + If the file contains multiple entries for the same parameter, > + all but the last one are ignored. > Looking at this patch quickly but also in isolation, so I could be wrong here, but it seems like the above might be a good place to mention "duplicate entries and comments may be removed." > > @@ -185,18 +187,27 @@ shared_buffers = 128MB > In addition to postgresql.conf, > a PostgreSQL data directory contains a file > > postgresql.auto.confpostgresql.auto.conf, > - which has the same format as postgresql.conf but > should > - never be edited manually. This file holds settings provided through > - the command. This file is > automatically > - read whenever postgresql.conf is, and its settings > take > - effect in the same way. Settings in > postgresql.auto.conf > - override those in postgresql.conf. > + which has the same format as postgresql.conf but > + is intended to be edited automatically not manually. This file holds > + settings provided through the command. > + This file is read whenever postgresql.conf is, > + and its settings take effect in the same way. Settings > + in postgresql.auto.conf override those > + in postgresql.conf. > + The above hunk looks fine. > + > + External tools might also > + modify postgresql.auto.conf, typically by appending > + new settings to the end. It is not recommended to do this while the > + server is running, since a concurrent ALTER SYSTEM > + command could overwrite such changes. > Alternatively, or maybe also here, we could say "note that appending to the file as a mechanism for setting a new value by an external tool is acceptable even though it will cause duplicates- PostgreSQL will always use the last value set and other tools should as well. Duplicates and comments may be removed when rewriting the file, and parameters may be lower-cased." (istr that last bit being true too but I haven't checked lately). > > The system view >linkend="view-pg-file-settings">pg_file_settings > - can be helpful for pre-testing changes to the configuration file, or for > + can be helpful for pre-testing changes to the configuration files, or > for > diagnosing problems if a SIGHUP signal did not > have the > desired effects. > This hunk looks fine. Reviewing https://www.postgresql.org/docs/current/config-setting.html again, it looks reasonably comprehensive regarding the format of the file- perhaps we should link to there from the "external tools might also modify" para..? "Tool authors should review to understand the structure of postgresql.auto.conf". Thanks! Stephen signature.asc Description: PGP signature
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Wed, Jul 31, 2019 at 09:25:01AM -0400, Sehrope Sarkuni wrote: > On Tue, Jul 30, 2019 at 4:48 PM Bruce Momjian wrote: > > I had more time to think about the complexity of adding relfilenode to > the IV. Since relfilenode is only unique within a database/tablespace, > we would need to have pg_upgrade preserve database/tablespace oids > (which I assume are the same as the directory and tablespace symlinks). > Then, to decode a page, you would need to look up those values. This is > in addition to the new complexity of CREATE DATABASE and moving files > between tablespaces. I am also concerned that crash recovery operations > and cluster forensics and repair would need to also deal with this. > > I am not even clear if pg_upgrade preserving relfilenode is possible --- > when we wrap the relfilenode counter, does it start at 1 or at the > first-user-relation-oid? If the former, it could conflict with oids > assigned to new system tables in later major releases. Tying the > preservation of relations to two restrictions seems risky. > > > Agreed. Unless you know for sure the input is going to immutable across copies > or upgrades, including anything in either the IV or key derivation gets risky > and could tie you down for the future. That's partly why I like the idea > separate salt (basically you directly pay for the complexity by tracking > that). Yes, fragility is something to be concerned about. The system is already very complex, and we occasionally have to do forensic work or repairs. > Even if we do not include a separate per-relation salt or things like > relfilenode when generating a derived key, we can still include other types of > immutable attributes. For example the fork type could be included to > eventually > allow multiple forks for the same relation to be encrypted with the same IV = > LSN + Page Number as the derived key per-fork would be distinct. Yes, the fork number could be useful in this case. I was thinking we would just leave the extra bits as zeros and we can then set it to '1' or something else for a different fork. > > Using just the page LSN and page number allows a page to be be > decrypted/encrypted independently of its file name, tablespace, and > database, and I think that is a win for simplicity. Of course, if it is > insecure we will not do it. > > > As LSN + Page Number combo is unique for all relations (not just one relation) > I think we're good for pages. Yes, since a single LSN can only be used for a single relation, and I added an Assert to check that. Good. > I am thinking for the heap/index IV, it would be: > > uint64 lsn; > unint32 page number; > /* only uses 11 bits for a zero-based CTR counter for 32k pages */ > uint32 counter; > > > Looks good. > > > and for WAL it would be: > > uint64 segment_number; > uint32 counter; > /* guarantees this IV doesn't match any relation IV */ > uint32 2^32-1 /* all 1's */ > > > I need to read up more on the structure of the WAL records but here's some > high > level thoughts: > > WAL encryption should not use the same key as page encryption so there's no > need to design the IV to try to avoid matching the page IVs. Even a basic > derivation with a single fixed WDEK = HKDF(MDEK, "WAL") and TDEK = HKDF(MDEK, > "PAGE") would ensure separate keys. That's the the literal string "WAL" or > "PAGE" being added as a salt to generate the respective keys, all that matters > is they're different. I was thinking the WAL would use the same key since the nonce is unique between the two. What value is there in using a different key? > Ideally WAL encryption would generating new derived keys as part of the WAL > stream. The WAL stream is not fixed so you have the luxury of being able to > add > a "Use new random salt XZY going forward" records. Forcing generation of a new > salt/key upon promotion of a replica would ensure that at least the WAL is > unique going forward. Could also generate a new upon server startup, after Ah, yes, good point, and using a derived key would make that easier. The tricky part is what to use to create the new derived key, unless we generate a random number and store that somewhere in the data directory, but that might lead to fragility, so I am worried. We have pg_rewind, which allows to make the WAL go backwards. What is the value in doing this? > every N bytes, or a new one for each new WAL file. There's much more > flexibility compared to page encryption. > > As WAL is a single continuous stream, we can start the IV for each derived WAL > key from zero. There's no need to complicate it further as Key + IV will never > be reused. Uh, you want a new random key for each WAL file? I was going to use the WAL segment number as the nonce, which is always increasing, and easily determined. The f
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Wed, Jul 31, 2019 at 9:23 AM Anastasia Lubennikova wrote: > > * Included my own pageinspect hack to visualize the minimum TIDs in > > posting lists. It's broken out into a separate patch file. The code is > > very rough, but it might help someone else, so I thought I'd include > > it. > Cool, I think we should add it to the final patchset, > probably, as separate function by analogy with tuple_data_split. Good idea. Attached is v5, which is based on your v4. The three main differences between this and v4 are: * Removed BT_COMPRESS_THRESHOLD stuff, for the reasons explained in my July 24 e-mail. We can always add something like this back during performance validation of the patch. Right now, having no BT_COMPRESS_THRESHOLD limit definitely improves space utilization for certain important cases, which seems more important than the uncertain/speculative downside. * We now have experimental support for unique indexes. This is broken out into its own patch. * We now handle LP_DEAD items in a special way within _bt_insertonpg_in_posting(). As you pointed out already, we do need to think about LP_DEAD items directly, rather than assuming that they cannot be on the page that _bt_insertonpg_in_posting() must process. More on that later. > If sizeof(t_info) + sizeof(key) < sizeof(t_tid), resulting posting tuple > can be > larger. It may happen if keysize <= 4 byte. > In this situation original tuples must have been aligned to size 16 > bytes each, > and resulting tuple is at most 24 bytes (6+2+4+6+6). So this case is > also safe. I still need to think about the exact details of alignment within _bt_insertonpg_in_posting(). I'm worried about boundary cases there. I could be wrong. > I changed DEBUG message to ERROR in v4 and it passes all regression tests. > I doubt that it covers all corner cases, so I'll try to add more special > tests. It also passes my tests, FWIW. > Hmm, I can't get the problem. > In current implementation each posting tuple is smaller than BTMaxItemSize, > so no split can lead to having tuple of larger size. That sounds correct, then. > No, we don't need them both. I don't mind combining them into one macro. > Actually, we never needed BTreeTupleGetMinTID(), > since its functionality is covered by BTreeTupleGetHeapTID. I've removed BTreeTupleGetMinTID() in v5. I think it's fine to just have a comment next to BTreeTupleGetHeapTID(), and another comment next to BTreeTupleGetMaxTID(). > The main reason why I decided to avoid applying compression to unique > indexes > is the performance of microvacuum. It is not applied to items inside a > posting > tuple. And I expect it to be important for unique indexes, which ideally > contain only a few live values. I found that the performance of my experimental patch with unique index was significantly worse. It looks like this is a bad idea, as you predicted, though we may still want to do deduplication/compression with NULL values in unique indexes. I did learn a few things from implementing unique index support, though. BTW, there is a subtle bug in how my unique index patch does WAL-logging -- see my comments within index_compute_xid_horizon_for_tuples(). The bug shouldn't matter if replication isn't used. I don't think that we're going to use this experimental patch at all, so I didn't bother fixing the bug. > if (ItemIdIsDead(itemId)) > continue; > > In the previous review Rafia asked about "some reason". > Trying to figure out if this situation possible, I changed this line to > Assert(!ItemIdIsDead(itemId)) in our test version. And it failed in a > performance > test. Unfortunately, I was not able to reproduce it. I found it easy enough to see LP_DEAD items within _bt_insertonpg_in_posting() when running pgbench with the extra unique index patch. To give you a simple example of how this can happen, consider the comments about BTP_HAS_GARBAGE within _bt_delitems_vacuum(). That probably isn't the only way it can happen, either. ISTM that we need to be prepared for LP_DEAD items during deduplication, rather than trying to prevent deduplication from ever having to see an LP_DEAD item. v5 makes _bt_insertonpg_in_posting() prepared to overwrite an existing item if it's an LP_DEAD item that falls in the same TID range (that's _bt_compare()-wise "equal" to an existing tuple, which may or may not be a posting list tuple already). I haven't made this code do something like call index_compute_xid_horizon_for_tuples(), even though that's needed for correctness (i.e. this new code is currently broken in the same way that I mentioned unique index support is broken). I also added a nearby FIXME comment to _bt_insertonpg_in_posting() -- I don't think think that the code for splitting a posting list in two is currently crash-safe. How do you feel about officially calling this deduplication, not compression? I think that it's a more accurate name for the technique. -- Peter Geoghegan v5-0001-Compression-deduplication-in-nbtree.patch Descr