Re: More tests with USING INDEX replident and dropped indexes
On Thu, Aug 27, 2020 at 11:28:35AM +0900, Michael Paquier wrote: > Attached is a patch for 1) and 2) grouped together, to ease review for > now. I think that we had better fix 1) separately though, so I am > going to start a new thread about that with a separate patch as the > current thread is misleading. A fix for consistency problem with indisreplident and invalid indexes has been committed as of 9511fb37. Attached is a rebased patch, where I noticed two incorrect things: - The comment of REPLICA_IDENTITY_INDEX is originally incorrect. If the index is dropped, the replica index switches back to nothing. - relreplident was getting reset one transaction too late, when the old index is marked as dead. -- Michael From fa7585609a614bdba27bc2e3199379d04fc0eabd Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Sun, 30 Aug 2020 16:02:32 +0900 Subject: [PATCH v5] Reset pg_class.relreplident when associated replica index is dropped When an index associated to REPLICA IDENTITY USING INDEX gets dropped, relreplident gets set to 'i', which can be confusing as the catalogs have a state inconsistent with pg_index that lacks an associated entry marked with indisreplident. This changes the logic so as such an index dropped leads to 'n' to be set for the parent table, equivalent to REPLICA IDENTITY NOTHING. Author: Michael Paquier Reviewed-by: Masahiko Sawada, Rahila Syed, Euler Taveira Discussion: https://postgr.es/m/20200522035028.go2...@paquier.xyz --- src/include/catalog/pg_class.h| 6 +- src/include/commands/tablecmds.h | 2 + src/backend/catalog/index.c | 37 +++- src/backend/commands/tablecmds.c | 58 --- .../regress/expected/replica_identity.out | 30 ++ src/test/regress/sql/replica_identity.sql | 22 +++ doc/src/sgml/catalogs.sgml| 3 +- 7 files changed, 128 insertions(+), 30 deletions(-) diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h index 78b33b2a7f..4f9ce17651 100644 --- a/src/include/catalog/pg_class.h +++ b/src/include/catalog/pg_class.h @@ -175,11 +175,7 @@ typedef FormData_pg_class *Form_pg_class; #define REPLICA_IDENTITY_NOTHING 'n' /* all columns are logged as replica identity */ #define REPLICA_IDENTITY_FULL 'f' -/* - * an explicitly chosen candidate key's columns are used as replica identity. - * Note this will still be set if the index has been dropped; in that case it - * has the same meaning as 'd'. - */ +/* an explicitly-chosen candidate key's columns are used as replica identity */ #define REPLICA_IDENTITY_INDEX 'i' /* diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h index c1581ad178..b163a82c52 100644 --- a/src/include/commands/tablecmds.h +++ b/src/include/commands/tablecmds.h @@ -61,6 +61,8 @@ extern void ExecuteTruncateGuts(List *explicit_rels, List *relids, List *relids_ extern void SetRelationHasSubclass(Oid relationId, bool relhassubclass); +extern void SetRelationReplIdent(Oid relationId, char ri_type); + extern ObjectAddress renameatt(RenameStmt *stmt); extern ObjectAddress RenameConstraint(RenameStmt *stmt); diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c index 62e487bb4c..61d05495b4 100644 --- a/src/backend/catalog/index.c +++ b/src/backend/catalog/index.c @@ -2005,6 +2005,7 @@ index_drop(Oid indexId, bool concurrent, bool concurrent_lock_mode) Relation indexRelation; HeapTuple tuple; bool hasexprs; + bool resetreplident; LockRelId heaprelid, indexrelid; LOCKTAG heaplocktag; @@ -2048,6 +2049,15 @@ index_drop(Oid indexId, bool concurrent, bool concurrent_lock_mode) */ CheckTableNotInUse(userIndexRelation, "DROP INDEX"); + /* + * Check if the REPLICA IDENTITY of the parent relation needs to be + * reset. This should be done only if the index to drop here is + * marked as used in a replica identity. We cannot rely on the + * contents of pg_index for the index as a concurrent drop would have + * reset indisreplident already, so save the existing value first. + */ + resetreplident = userIndexRelation->rd_index->indisreplident; + /* * Drop Index Concurrently is more or less the reverse process of Create * Index Concurrently. @@ -2132,7 +2142,29 @@ index_drop(Oid indexId, bool concurrent, bool concurrent_lock_mode) */ LockRelationIdForSession(&heaprelid, ShareUpdateExclusiveLock); LockRelationIdForSession(&indexrelid, ShareUpdateExclusiveLock); + } + /* + * If the index to be dropped was marked as indisreplident (note that it + * would have been reset when clearing indisvalid previously), reset + * pg_class.relreplident for the parent table. Note that this part + * needs to be done in the same transaction as the one marking the + * index as invalid, so as we never finish with the case where the + * parent relation uses REPLICA_IDENTITY_INDEX, without any index marked + * with indisrepl
Re: Improvements in Copy From
On Thu, Aug 27, 2020 at 11:02 AM Peter Smith wrote: > > Hello. > > FYI - that patch has conflicts when applied. > Thanks for letting me know. Attached new patch which is rebased on top of head. Regards, VIgnesh EnterpriseDB: http://www.enterprisedb.com From a343fe1f8fdf4293d2ef6841e243390b99f29e28 Mon Sep 17 00:00:00 2001 From: Vignesh C Date: Sun, 30 Aug 2020 12:31:12 +0530 Subject: [PATCH v2] Improvements in copy from. There are couple of improvements for copy from in this patch which is detailed below: a) copy from stdin copies lesser amount of data to buffer even though space is available in buffer because minread was passed as 1 to CopyGetData, fixed it by passing the actual space available in buffer, this reduces the frequent call to CopyGetData. b) Copy from reads header line and does nothing for the read line, we need not clear EOL & need not convert to server encoding for the header line. --- src/backend/commands/copy.c | 13 + 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c index db7d24a..c688baa 100644 --- a/src/backend/commands/copy.c +++ b/src/backend/commands/copy.c @@ -801,14 +801,18 @@ CopyLoadRawBuf(CopyState cstate) { int nbytes = RAW_BUF_BYTES(cstate); int inbytes; + int minread = 1; /* Copy down the unprocessed data if any. */ if (nbytes > 0) memmove(cstate->raw_buf, cstate->raw_buf + cstate->raw_buf_index, nbytes); + if (cstate->copy_dest == COPY_NEW_FE) + minread = RAW_BUF_SIZE - nbytes; + inbytes = CopyGetData(cstate, cstate->raw_buf + nbytes, - 1, RAW_BUF_SIZE - nbytes); + minread, RAW_BUF_SIZE - nbytes); nbytes += inbytes; cstate->raw_buf[nbytes] = '\0'; cstate->raw_buf_index = 0; @@ -3917,7 +3921,7 @@ CopyReadLine(CopyState cstate) } while (CopyLoadRawBuf(cstate)); } } - else + else if (!(cstate->cur_lineno == 0 && cstate->header_line)) { /* * If we didn't hit EOF, then we must have transferred the EOL marker @@ -3951,8 +3955,9 @@ CopyReadLine(CopyState cstate) } } - /* Done reading the line. Convert it to server encoding. */ - if (cstate->need_transcoding) + /* Done reading the line. Convert it to server encoding if not header. */ + if (cstate->need_transcoding && + !(cstate->cur_lineno == 0 && cstate->header_line)) { char *cvt; -- 1.8.3.1
Boundary value check in lazy_tid_reaped()
Hi all, In the current lazy vacuum implementation, some index AMs such as btree indexes call lazy_tid_reaped() for each index tuples during ambulkdelete to check if the index tuple points to the (collected) garbage tuple. In that function, we simply call bsearch(3) but we should be able to know the result without bsearch(3) if the index tuple points to the heap tuple that is out of range of the collected garbage tuples. I've checked whether bsearch(3) does the boundary value check or not but looking at the bsearch(3) implementation of FreeBSD libc[1], there is not. The same is true for glibc. So my proposal is to add boundary value check in lazy_tid_reaped() before executing bsearch(3). This will help when index vacuum happens multiple times or when garbage tuples are concentrated to a narrow range. I’ve done three performance tests. maintenance_work_mem is set to 64MB with which we can collect about 11 million tuples at maximum and the table size is 10GB. The table has one btree index. Here is the average of index vacuum (ambulkdelete) execution time of three trials: * Test case 1 I made all pages dirty with 15 million garbage tuples to invoke index vacuum twice. HEAD: 164.7 s Patched: 138.81 s * Test case 2 I made all pages dirty with 10 million garbage tuples to invoke index vacuum only once, checking the performance degradation. HEAD: 127.8 s Patched: 128.25 s * Test case 3 I made a certain range of table dirty with 1 million garbage tuples. HEAD: 31.07 s Patched: 9.41 s I thought that we can have a generic function wrapping bsearch(3) that does boundary value checks and then does bsearch(3) so that we can use it in other similar places as well. But the attached patch doesn't do that as I'd like to hear opinions on the proposal first. Feedback is very welcome. Regards, [1] https://svnweb.freebsd.org/base/head/lib/libc/stdlib/bsearch.c?revision=326025&view=markup -- Masahiko Sawadahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services vacuum_boundary_check.patch Description: Binary data
Re: Yet another fast GiST build (typo)
> 23 авг. 2020 г., в 14:39, Andrey M. Borodin написал(а): > > Thanks for reviewing and benchmarking, Pavel! Pavel sent me few typos offlist. PFA v12 fixing these typos. Thanks! Best regards, Andrey Borodin. v12-0001-Add-sort-support-for-point-gist_point_sortsuppor.patch Description: Binary data v12-0002-Implement-GiST-build-using-sort-support.patch Description: Binary data
Re: [EXTERNAL] Re: WIP: WAL prefetch (another approach)
Greetings, * Tomas Vondra (tomas.von...@2ndquadrant.com) wrote: > On Thu, Aug 27, 2020 at 04:28:54PM -0400, Stephen Frost wrote: > >* Robert Haas (robertmh...@gmail.com) wrote: > >>On Thu, Aug 27, 2020 at 2:51 PM Stephen Frost wrote: > >>> > Hm? At least earlier versions didn't do prefetching for records with an > >>> > fpw, and only for subsequent records affecting the same or if not in > >>> > s_b anymore. > >>> > >>> We don't actually read the page when we're replaying an FPW though..? > >>> If we don't read it, and we entirely write the page from the FPW, how is > >>> pre-fetching helping..? > >> > >>Suppose there is a checkpoint. Then we replay a record with an FPW, > >>pre-fetching nothing. Then the buffer gets evicted from > >>shared_buffers, and maybe the OS cache too. Then, before the next > >>checkpoint, we again replay a record for the same page. At this point, > >>pre-fetching should be helpful. > > > >Sure- but if we're talking about 25GB of WAL, on a server that's got > >32GB, then why would those pages end up getting evicted from memory > >entirely? Particularly, enough of them to end up with such a huge > >difference in replay time.. > > > >I do agree that if we've got more outstanding WAL between checkpoints > >than the system's got memory then that certainly changes things, but > >that wasn't what I understood the case to be here. > > I don't think it's very clear how much WAL there actually was in each > case - the message only said there was more than 25GB, but who knows how > many checkpoints that covers? In the cases with FPW=on this may easily > be much less than one checkpoint (because with scale 45GB an update to > every page will log 45GB of full-page images). It'd be interesting to > see some stats from pg_waldump etc. Also in the message was this: -- In order to avoid checkpoints during benchmark, max_wal_size(200GB) and checkpoint_timeout(200 mins) are set to a high value. -- Which lead me to suspect, at least, that this was much less than a checkpoint, as you suggest. Also, given that the comment was 'run is cancelled when there is a reasonable amount of WAL (>25GB), seems likely that it's at least *around* there. Ultimately though, there just isn't enough information provided to really be able to understand what's going on. I agree, pg_waldump stats would be useful. > >>Admittedly, I don't quite understand whether that is what is happening > >>in this test case, or why SDD vs. HDD should make any difference. But > >>there doesn't seem to be any reason why it doesn't make sense in > >>theory. > > > >I agree that this could be a reason, but it doesn't seem to quite fit in > >this particular case given the amount of memory and WAL. I'm suspecting > >that it's something else and I'd very much like to know if it's a > >general "this applies to all (most? a lot of?) SSDs because the > >hardware has a larger than 8KB page size and therefore the kernel has to > >read it", or if it's something odd about this particular system and > >doesn't apply generally. > > Not sure. I doubt it has anything to do with the hardware page size, > that's mostly transparent to the kernel anyway. But it might be that the > prefetching on a particular SSD has more overhead than what it saves. Right- I wouldn't have thought the hardware page size would matter either, but it's entirely possible that assumption is wrong and that it does matter for some reason- perhaps with just some SSDs, or maybe with a lot of them, or maybe there's something else entirely going on. About all I feel like I can say at the moment is that I'm very interested in ways to make WAL replay go faster and it'd be great to get more information about what's going on here to see if there's something we can do to generally improve WAL replay. Thanks, Stephen signature.asc Description: PGP signature
Re: [PATCH] Covering SPGiST index
> 27 авг. 2020 г., в 21:03, Pavel Borisov написал(а): > > see v8 For me is the only concerning point is putting nullmask and varatt bits into tuple->nextOffset. But, probably, we can go with this. But let's change macro a bit. When I see SGLT_SET_OFFSET(leafTuple->nextOffset, InvalidOffsetNumber); I expect that leafTuple->nextOffset is function argument by value and will not be changed. For example see ItemPointerSetOffsetNumber() - it's not exposing ip_posid. Also, I'd propose instead of >*(leafChainDatums + i * natts) and leafChainIsnulls + i * natts using something like >int some_index = i * natts; >leafChainDatumsp[some_index] and &leafChainIsnulls[some_index] But, probably, it's a matter of taste... Also I'm not sure would it be helpful to use instead of >isnull[0] and leafDatum[0] more complex >#define SpgKeyIndex 0 >isnull[SpgKeyIndex] and leafDatum[SpgKeyIndex] There is so many [0] in the patch... Thanks! Best regards, Andrey Borodin.
[PATCH v1] explicit_bzero.c: using explicit_memset on NetBSD
Thanks. Regards. 0001-explicit_bzero.c-uses-explicit_memset-for-NetBSD.patch Description: Binary data
Re: Rare link canary failure in dblink test
Thomas Munro writes: > A couple of recent cases where an error "libpq is incorrectly linked > to backend functions" broke the dblink test: lorikeet seems just plain unstable these days :-(. Don't know why. > ... curiously the message also appeared a > couple of times on two Unixen. Perhaps we can write those off as lost > in the mists of time. No, those were expected, because they ran between ed0cdf0e0, which intentionally introduced this error, and e3d77ea6b and 4fa3741d1 which fixed it. regards, tom lane
Re: Disk-based hash aggregate's cost model
On Sun, Aug 30, 2020 at 02:26:20AM +0200, Tomas Vondra wrote: On Fri, Aug 28, 2020 at 06:32:38PM -0700, Jeff Davis wrote: On Thu, 2020-08-27 at 17:28 -0700, Peter Geoghegan wrote: We have a Postgres 13 open item for Disk-based hash aggregate, which is the only non-trivial open item. There is a general concern about how often we get disk-based hash aggregation when work_mem is particularly low, and recursion seems unavoidable. This is generally thought to be a problem in the costing. We discussed two approaches to tweaking the cost model: 1. Penalize HashAgg disk costs by a constant amount. It seems to be chosen a little too often, and we can reduce the number of plan changes. 2. Treat recursive HashAgg spilling skeptically, and heavily penalize recursive spilling. The problem with approach #2 is that we have a default hash mem of 4MB, and real systems have a lot more than that. In this scenario, recursive spilling can beat Sort by a lot. I think the main issue is that we're mostly speculating what's wrong. I've shared some measurements and symptoms, and we've discussed what might be causing it, but I'm not really sure we know for sure. I really dislike (1) because it seems more like "We don't know what's wrong so we'll penalize hashagg," kind of solution. A much more principled solution would be to tweak the costing accordingly, not just by adding some constant. For (2) it really depends if recursive spilling is really the problem here. In the examples I shared, the number of partitions/batches was very different, but the query duration was mostly independent (almost constant). I've decided to look at the costing a bit more closely today, and see why the costing is so much lower compared to sort/groupagg. I've used the same 32GB dataset and query as in [1]. I've repeated tests for all the work_mem values, and I see the number of batches are much lower, probably thanks to the HLL improvement: 2MBPlanned: 64Batches (old): 4160Batches: 2977 4MBPlanned: 128Batches (old): 16512Batches: 1569 8MBPlanned: 256Batches (old): 21488Batches: 1289 64MBPlanned: 32Batches (old): 2720Batches: 165 256MBPlanned: 8Batches (old): 8Batches:41 The impact on duration of the queries seems pretty negligible, though. The plans with work_mem=64MB look like this: 1) hashagg QUERY PLAN Limit (cost=11267293.86..11267293.86 rows=1 width=36) (actual time=213127.515..213127.517 rows=0 loops=1) -> HashAggregate (cost=10229061.10..11267293.86 rows=6718533 width=36) (actual time=100862.623..212948.642 rows=640 loops=1) Group Key: lineitem.l_partkey Planned Partitions: 32 Batches: 165 Memory Usage: 67857kB Disk Usage: 6802432kB -> Seq Scan on lineitem (cost=0.00..5519288.36 rows=191990736 width=9) (actual time=0.418..20344.631 rows=192000551 loops=1) Planning Time: 0.064 ms Execution Time: 213441.986 ms (7 rows) 2) groupagg QUERY PLAN -- Limit (cost=36755617.81..36755617.81 rows=1 width=36) (actual time=180029.594..180029.595 rows=0 loops=1) -> GroupAggregate (cost=35214909.30..36755617.81 rows=6718533 width=36) (actual time=94017.788..179857.683 rows=640 loops=1) Group Key: lineitem.l_partkey -> Sort (cost=35214909.30..35694886.14 rows=191990736 width=9) (actual time=94017.750..151138.727 rows=192000551 loops=1) Sort Key: lineitem.l_partkey Sort Method: external merge Disk: 3742208kB -> Seq Scan on lineitem (cost=0.00..5519288.36 rows=191990736 width=9) (actual time=0.008..26831.264 rows=192000551 loops=1) Planning Time: 0.063 ms Execution Time: 180209.580 ms (9 rows) I still don't understand why the hashagg is costed so low compared to the sort (only about 1/3 of it), because both cases use exactly the same estimates internally. cost_tuplesort ends up with npages = 937455 nruns = 114.435396 input_bytes = 7679629440 log_runs = 1.0 while cost_agg uses pages_read = 937455 pages_written = 937455 relation_size = 7679629440; yet we end up with much lower estimate for hashagg. It however does seem to me this is mostly due to non-I/O costs, considered by cost_tuplesort and perhaps ignored by cost_agg? In particular, most of the sort cost comes from this *startup_cost = comparison_cost * tuples * LOG2(tuples); So I'm wondering if the hashagg is not ignoring similar non-I/O costs for the spilling case. In particular, the initial section computing startu
Re: Row estimates for empty tables
Pavel Stehule writes: > I'll mark this patch as ready for commit Pushed, thanks for looking. regards, tom lane
Re: list of extended statistics on psql
On 2020-Aug-30, Tomas Vondra wrote: > On Sat, Aug 29, 2020 at 06:43:47PM -0400, Alvaro Herrera wrote: > > On 2020-Aug-29, Tomas Vondra wrote: > > > Also, it might be useful to show the size of the statistics built, just > > > like we show for \d+ etc. > > > > \dX+ I suppose? > > Right. I've only used \d+ as an example of an existing command showing > sizes of the objects. Yeah, I understood it that way too. How can you measure the size of the stat objects in a query? Are you thinking in pg_column_size()? I wonder how to report that. Knowing that psql \-commands are not meant for anything other than human consumption, maybe we can use a format() string that says "built: %d bytes" when \dX+ is used (for each stat type), and just "built" when \dX is used. What do people think about this? -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: list of extended statistics on psql
On Sun, Aug 30, 2020 at 12:33:29PM -0400, Alvaro Herrera wrote: On 2020-Aug-30, Tomas Vondra wrote: On Sat, Aug 29, 2020 at 06:43:47PM -0400, Alvaro Herrera wrote: > On 2020-Aug-29, Tomas Vondra wrote: > > Also, it might be useful to show the size of the statistics built, just > > like we show for \d+ etc. > > \dX+ I suppose? Right. I've only used \d+ as an example of an existing command showing sizes of the objects. Yeah, I understood it that way too. How can you measure the size of the stat objects in a query? Are you thinking in pg_column_size()? Either that or simply length() on the bytea value. I wonder how to report that. Knowing that psql \-commands are not meant for anything other than human consumption, maybe we can use a format() string that says "built: %d bytes" when \dX+ is used (for each stat type), and just "built" when \dX is used. What do people think about this? I'd use the same approach as \d+, i.e. a separate column with the size. Maybe that'd mean too many columns, though. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Deprecating postfix and factorial operators in PostgreSQL 13
> On Aug 28, 2020, at 8:56 AM, Tom Lane wrote: > > Robert Haas writes: >> So, in this version, there are six copies of the deprecation notice >> John wrote, rather than just one. Maybe we need more than one, but I >> doubt we need six. I don't think the CREATE OPERATOR documentation >> needs to mention this both when first introducing the concept and then >> again when defining right_type; the former seems sufficient. I don't >> think xoper.sgml needs these changes either; they interrupt the flow >> of the narrative and I don't think this is where anyone would look for >> a deprecation notice. I would also argue for dropping the mentions in >> the docs for ALTER OPERATOR FAMILY and CREATE OPERATOR CLASS, although >> those seem less clear-cut. Really, CREATE OPERATOR where John had it >> originally seems like the right place. > > Yeah, I agree that there are way too many copies here. CREATE OPERATOR > seems sufficient. It also seems like we should just rewrite the typeconv > and drop_operator examples to use some other operator. We'll have > to do that eventually anyway, so why not now, instead of visiting those > places twice? John's deprecation language now only appears in the create operator documentation. The first typeconv example was based on the (int8 !) operator. I chose to replace it with and example based on the (jsonb ? text) operator, as the two operators have a relevant similarity. Both of them are singletons, with only one interpretation in the standard catalogs. In both cases, the scanner cannot know the types of the undecorated arguments and assigns default types (integer and unknown, respectively), which get resolved later to match the only types accepted by the operator ('int8' for !, and 'jsonb,text' for ?). The drop operator example has been left, though altered to include the adjective "deprecated". Robert mentions that the entire example could simply be dropped now, and I agree with that, but it also makes sense to me to drop the example in 14, when the operation it describes is also dropped. If the committer who picks this up wants to drop that example now, that's fine, too. v3-0001-Adding-deprecation-notices.patch Description: Binary data — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: list of extended statistics on psql
Tomas Vondra writes: > On Sun, Aug 30, 2020 at 12:33:29PM -0400, Alvaro Herrera wrote: >> I wonder how to report that. Knowing that psql \-commands are not meant >> for anything other than human consumption, maybe we can use a format() >> string that says "built: %d bytes" when \dX+ is used (for each stat type), >> and just "built" when \dX is used. What do people think about this? Seems a little too cute to me. > I'd use the same approach as \d+, i.e. a separate column with the size. > Maybe that'd mean too many columns, though. psql already has \d commands with so many columns that you pretty much have to use \x mode to make them legible; \df+ for instance. I don't mind if \dX+ is also in that territory. It'd be good though if plain \dX can fit in a normal terminal window. regards, tom lane
Re: Row estimates for empty tables
ne 30. 8. 2020 v 18:23 odesílatel Tom Lane napsal: > Pavel Stehule writes: > > I'll mark this patch as ready for commit > > Pushed, thanks for looking. > Thank you Pavel > > regards, tom lane >
Re: Deprecating postfix and factorial operators in PostgreSQL 13
Mark Dilger writes: > [ v3-0001-Adding-deprecation-notices.patch ] Pushed with some fiddling. We previously found that adding id tags to constructs in the function lists didn't work in PDF output [1]. Your patch did build a PDF without warnings for me, which is odd --- apparently we changed something since April? But the links didn't lead to quite the right place, so the conclusion that there's something broken with that still holds. I made it look like the existing usages for encode() and decode(). I did not like your choice of "jsonb ? text" for the typeconv example at all, primarily because it introduces a host of distractions for anyone who's not already quite familiar with both JSON and that operator. After some digging around I found the |/ (square root) operator, which likewise has just one instance, and it's something familiar enough that most readers probably won't be slowed down by trying to figure out what it does. Also, this more nearly covers the territory of the original example, namely auto-casting an integer constant to something else. There are later examples covering unknown-type literals, so we don't need to address that scenario in this one. I also found another example that depends on the ! operator, in the operator precedence discussion. I concluded after study that the particular case it's on about only arises for postfix operators, so I just got rid of that example in favor of a generalized mention that you should use parentheses to override operator precedence. I concluded that there's not much point in touching drop_operator.sgml at all yet. I don't think labeling ! as deprecated as adding anything to that example, and besides there's another example that will also need to be changed. regards, tom lane [1] https://www.postgresql.org/message-id/32159.1586738304%40sss.pgh.pa.us
Re: Compatible defaults for LEAD/LAG
Pavel Stehule writes: > This is nice example of usage of anycompatible type (that is consistent > with other things in Postgres), but standard says something else. > It can be easily solved with https://commitfest.postgresql.org/28/2081/ - > but Tom doesn't like this patch. > I am more inclined to think so this feature should be implemented > differently - there is no strong reason to go against standard or against > the implementations of other databases (and increase the costs of porting). > Now the implementation is limited, but allowed behaviour is 100% ANSI. I don't particularly buy this argument. The case at hand is what to do if we have, say, select lag(integer_column, 1, 1.2) over ... The proposed patch would result in the output being of type numeric, and any rows using the default would show "1.2". The spec says that the right thing is to return integer, and we should round the default to "1" to make that work. But (1) I doubt that anybody actually writes such things; (2) For anyone who does write it, the spec's behavior fails to meet the principle of least surprise. It is not normally the case that any information-losing cast would be applied silently within an expression. So this deviation from spec doesn't bother me; we have much bigger ones. My concern with this patch is what I said upthread: I'm not sure that we should be adjusting this behavior in a piecemeal fashion. I looked through pg_proc to find all the functions that have more than one any* argument, and found these: oid | prorettype ---+ lag(anyelement,integer,anyelement)| anyelement lead(anyelement,integer,anyelement) | anyelement width_bucket(anyelement,anyarray) | integer btarraycmp(anyarray,anyarray) | integer array_eq(anyarray,anyarray) | boolean array_ne(anyarray,anyarray) | boolean array_lt(anyarray,anyarray) | boolean array_gt(anyarray,anyarray) | boolean array_le(anyarray,anyarray) | boolean array_ge(anyarray,anyarray) | boolean array_append(anyarray,anyelement) | anyarray array_prepend(anyelement,anyarray)| anyarray array_cat(anyarray,anyarray) | anyarray array_larger(anyarray,anyarray) | anyarray array_smaller(anyarray,anyarray) | anyarray array_position(anyarray,anyelement) | integer array_position(anyarray,anyelement,integer) | integer array_positions(anyarray,anyelement) | integer[] array_remove(anyarray,anyelement) | anyarray array_replace(anyarray,anyelement,anyelement) | anyarray arrayoverlap(anyarray,anyarray) | boolean arraycontains(anyarray,anyarray) | boolean arraycontained(anyarray,anyarray) | boolean elem_contained_by_range(anyelement,anyrange) | boolean range_contains_elem(anyrange,anyelement) | boolean range_eq(anyrange,anyrange) | boolean range_ne(anyrange,anyrange) | boolean range_overlaps(anyrange,anyrange) | boolean range_contains(anyrange,anyrange) | boolean range_contained_by(anyrange,anyrange) | boolean range_adjacent(anyrange,anyrange) | boolean range_before(anyrange,anyrange) | boolean range_after(anyrange,anyrange)| boolean range_overleft(anyrange,anyrange) | boolean range_overright(anyrange,anyrange)| boolean range_union(anyrange,anyrange)| anyrange range_merge(anyrange,anyrange)| anyrange range_intersect(anyrange,anyrange)| anyrange range_minus(anyrange,anyrange)| anyrange range_cmp(anyrange,anyrange) | integer range_lt(anyrange,anyrange) | boolean range_le(anyrange,anyrange) | boolean range_ge(anyrange,anyrange) | boolean range_gt(anyrange,anyrange) | boolean range_gist_same(anyrange,anyrange,internal) | internal (45 rows) Now, there's no point in changing range_eq and the later entries in this table (the ones taking two anyrange's); our rather lame definition of anycompatiblerange means that we'd get no benefit from doing so. But I think there's a strong case for changing everything before range_eq. It'd be nice if something like SELECT array[1] < array[1.1]; would work the same way that "SELECT 1 < 1.1" would. I checked the other concern that I had about whether the more flexible polymorphic definitions would create any new ambiguities, and I don't see any problems with this list. As functions, none of these names are overloaded, except with different numbers of arguments so there's no ambiguity. Most of the array functions are also operators, bu
Re: New default role- 'pg_read_all_data'
Greetings, * Stephen Frost (sfr...@snowman.net) wrote: > * Magnus Hagander (mag...@hagander.net) wrote: > > On Fri, Aug 28, 2020 at 2:38 PM Stephen Frost wrote: > > > * Magnus Hagander (mag...@hagander.net) wrote: > > > > Without having actually looked at the code, definite +1 for this > > > > feature. > > > > It's much requested... > > > > > > Thanks. > > > > > > > But, should we also have a pg_write_all_data to go along with it? > > > > > > Perhaps, but could certainly be a different patch, and it'd need to be > > > better defined, it seems to me... read_all is pretty straight-forward > > > (the general goal being "make pg_dumpall/pg_dump work"), what would > > > write mean? INSERT? DELETE? TRUNCATE? ALTER TABLE? System catalogs? > > > > Well, it's pg_write_all_*data*, so it certainly wouldn't be alter table or > > system catalogs. > > > > I'd say insert/update/delete yes. > > > > TRUNCATE is always an outlier.Given it's generally classified as DDL, I > > wouldn't include it. > > Alright, that seems like it'd be pretty easy. We already have a check > in pg_class_aclmask to disallow modification of system catalogs w/o > being a superuser, so we should be alright to add a similar check for > insert/update/delete just below where I added the SELECT check. > > > > Doesn't seem like you could just declare it to be 'allow pg_restore' > > > either, as that might include creating untrusted functions, et al. > > > > No definitely not. That wouldn't be the usecase at all :) > > Good. :) > > > (and fwiw to me the main use case for read_all_data also isn't pg_dump, > > because most people using pg_dump are already db owner or higher in my > > experience. But it is nice that it helps with that too) > > Glad to have confirmation that there's other use-cases this helps with. > > I'll post an updated patch with that added in a day or two. Here's that updated patch, comments welcome. Thanks! Stephen diff --git a/doc/src/sgml/user-manag.sgml b/doc/src/sgml/user-manag.sgml index 829decd883..1213125bfd 100644 --- a/doc/src/sgml/user-manag.sgml +++ b/doc/src/sgml/user-manag.sgml @@ -517,6 +517,20 @@ DROP ROLE doomed_role; + + pg_read_all_data + Read all data (tables, views, sequences), as if having SELECT + rights on those objects, and USAGE rights on all schemas, even without + having it explicitly. This role also has BYPASSRLS + set for it. + + + pg_write_all_data + Write all data (tables, views, sequences), as if having INSERT, + UPDATE, and DELETE rights on those objects, and USAGE rights on all + schemas, even without having it explicitly. This role also has + BYPASSRLS set for it. + pg_read_all_settings Read all configuration variables, even those normally visible only to diff --git a/src/backend/catalog/aclchk.c b/src/backend/catalog/aclchk.c index c626161408..3e6d060554 100644 --- a/src/backend/catalog/aclchk.c +++ b/src/backend/catalog/aclchk.c @@ -3849,6 +3849,26 @@ pg_class_aclmask(Oid table_oid, Oid roleid, ReleaseSysCache(tuple); + /* + * Check if ACL_SELECT is being checked and, if so, and not set already + * as part of the result, then check if the user is a member of the + * pg_read_all_data role, which allows read access to all relations. + */ + if (mask & ACL_SELECT && !(result & ACL_SELECT) && + has_privs_of_role(roleid, DEFAULT_ROLE_READ_ALL_DATA)) + result |= ACL_SELECT; + + /* + * Check if ACL_INSERT, ACL_UPDATE, or ACL_DELETE is being checked + * and, if so, and not set already as part of the result, then check + * if the user is a member of the pg_write_all_data role, which + * allows INSERT/UPDATE/DELETE access to all relations. + */ + if (mask & (ACL_INSERT | ACL_UPDATE | ACL_DELETE) && + !(result & (ACL_INSERT | ACL_UPDATE | ACL_DELETE)) && + has_privs_of_role(roleid, DEFAULT_ROLE_WRITE_ALL_DATA)) + result |= (mask & (ACL_INSERT | ACL_UPDATE | ACL_DELETE)); + return result; } @@ -4175,6 +4195,16 @@ pg_namespace_aclmask(Oid nsp_oid, Oid roleid, ReleaseSysCache(tuple); + /* + * Check if ACL_USAGE is being checked and, if so, and not set already + * as part of the result, then check if the user is a member of the + * pg_read_all_data or pg_write_all_data roles, which allow usage + * access to all schemas. + */ + if (mask & ACL_USAGE && !(result & ACL_USAGE) && + (has_privs_of_role(roleid, DEFAULT_ROLE_READ_ALL_DATA) || + has_privs_of_role(roleid, DEFAULT_ROLE_WRITE_ALL_DATA))) + result |= ACL_USAGE; return result; } diff --git a/src/include/catalog/pg_authid.dat b/src/include/catalog/pg_authid.dat index 7c08851550..3f72474146 100644 --- a/src/include/catalog/pg_authid.dat +++ b/src/include/catalog/pg_authid.dat @@ -25,6 +25,16 @@ rolcreaterole => 'f', rolcreatedb => 'f', rolcanlogin => 'f', rolreplication => 'f', rolbypassrls => 'f', rolconnlimit => '-1', rolpassword => '_null_', rolvaliduntil =
Re: list of extended statistics on psql
Hi Alvaro, IMO the per-type columns should show both the type being enabled as well as it being built. Hmm. I'm not sure how to get the status (enabled or disabled) of extended stats. :( Could you explain it more? pg_statistic_ext_data.stxdndistinct is not null if the stats have been built. (I'm not sure whether there's an easier way to determine this.) Ah.. I see! Thank you. I suggest to do this Name| Schema | Definition | Ndistinct | Dependencies | MCV ---++--+---+--+- stts_1| public | (a, b) FROM t1 | f | t| f I suppose that the current column order is sufficient if there is no improvement of extended stats on PG14. Do you know any plan to improve extended stats such as to allow it to cross multiple tables on PG14? I suggest that changing it in the future is going to be an uphill battle, so better get it right from the get go, without requiring a future restructure. I understand your suggestions. I'll replace "Columns" and "Table" columns with "Definition" column. Currently, I use this query to get Extended stats info from pg_statistic_ext. Maybe something like this would do SELECT stxnamespace::pg_catalog.regnamespace AS "Schema", stxname AS "Name", format('%s FROM %s', (SELECT pg_catalog.string_agg(pg_catalog.quote_ident(attname),', ') FROM pg_catalog.unnest(stxkeys) s(attnum) JOIN pg_catalog.pg_attribute a ON (stxrelid = a.attrelid AND a.attnum = s.attnum AND NOT attisdropped)), stxrelid::regclass) AS "Definition", CASE WHEN stxdndistinct IS NOT NULL THEN 'built' WHEN 'd' = any(stxkind) THEN 'enabled, not built' END AS "n-distinct", CASE WHEN stxddependencies IS NOT NULL THEN 'built' WHEN 'f' = any(stxkind) THEN 'enabled, not built' END AS "functional dependencies", CASE WHEN stxdmcv IS NOT NULL THEN 'built' WHEN 'm' = any(stxkind) THEN 'enabled, not built' END AS mcv FROM pg_catalog.pg_statistic_ext es INNER JOIN pg_catalog.pg_class c ON stxrelid = c.oid LEFT JOIN pg_catalog.pg_statistic_ext_data esd ON es.oid = esd.stxoid ORDER BY 1, 2, 3; Great! It helped me a lot to understand your suggestions correctly. Thanks. :-D I got the below results by your query. create table t1 (a int, b int); create statistics stts_1 (dependencies) on a, b from t1; create statistics stts_2 (dependencies, ndistinct) on a, b from t1; create statistics stts_3 (dependencies, ndistinct, mcv) on a, b from t1; create table t2 (a int, b int, c int); create statistics stts_4 on b, c from t2; create table hoge (col1 int, col2 int, col3 int); create statistics stts_hoge on col1, col2, col3 from hoge; insert into t1 select i,i from generate_series(1,100) i; analyze t1; Your query gave this result: Schema | Name| Definition | n-distinct | functional dependencies |mcv +---+++-+ public | stts_1| a, b FROM t1 || built | public | stts_2| a, b FROM t1 | built | built | public | stts_3| a, b FROM t1 | built | built | built public | stts_4| b, c FROM t2 | enabled, not built | enabled, not built | enabled, not built public | stts_hoge | col1, col2, col3 FROM hoge | enabled, not built | enabled, not built | enabled, not built (5 rows) I guess "enabled, not built" is a little redundant. The status would better to have three patterns: "built", "not built" or nothing (NULL) like these: - "built": extended stats is defined and built (collected by analyze cmd) - "not built": extended stats is defined but have not built yet - nothing (NULL): extended stats is not defined What do you think about it? I will send a new patch including : - Replace "Columns" and "Table" column with "Definition" - Show the status (built/not built/null) of extended stats by using pg_statistic_ext_data Thanks, Tatsuro Yamada
Re: Get rid of runtime handling of AlternativeSubPlan?
On Sun, Aug 30, 2020 at 7:26 AM Tom Lane wrote: > I wrote: > > Back in bd3daddaf232d95b0c9ba6f99b0170a0147dd8af, which introduced > > AlternativeSubPlans, I wrote: > > There is a lot more that could be done based on this infrastructure: in > > particular it's interesting to consider switching to the hash plan if > we start > > out using the non-hashed plan but find a lot more upper rows going by > than we > > expected. I have therefore left some minor inefficiencies in place, > such as > > initializing both subplans even though we will currently only use one. > > > > That commit will be twelve years old come August, and nobody has either > > built anything else atop it or shown any interest in making the plan > choice > > switchable mid-run. So it seems like kind of a failed experiment. > > > > Therefore, I'm considering the idea of ripping out all executor support > > for AlternativeSubPlan and instead having the planner replace an > > AlternativeSubPlan with the desired specific SubPlan somewhere late in > > planning, possibly setrefs.c. > > Here's a proposed patchset for that. This runs with the idea I'd had > that setrefs.c could be smarter than the executor about which plan node > subexpressions will be executed how many times. I did not take it very > far, for fear of adding an undue number of planning cycles, but it's still > better than what we have now. > > For ease of review, 0001 adds the new planner logic, while 0002 removes > the now-dead executor support. > > There's one bit of dead code that I left in place for the moment, which is > ruleutils.c's support for printing AlternativeSubPlans. I'm not sure if > that's worth keeping or not --- it's dead code for normal use, but if > someone tried to use ruleutils.c to print partially-planned expression > trees, maybe there'd be a use for it? > > (It's also arguable that readfuncs.c's support is now dead code, but > I have little interest in stripping that out.) > > regards, tom lane > > Thank you for this code! I still have some confusion about when a SubPlan should be executed when a join is involved. I care about this because this has an impact on when we can get the num_exec for a subplan. The subplan in a target list, it is executed after the join in my case. The subplan can be execute after the scan of T1(see below example) and it can also be executed after the join. Which one is better depends on which methods make the num_exec smaller. Is it something we already considered? I drill-down to populate_joinrel_with_paths and not find this logic. # explain (costs off) select (select a from t2 where t2.b = t1.b) from t1, t3; QUERY PLAN -- Nested Loop -> Seq Scan on t1 -> Materialize -> Seq Scan on t3 SubPlan 1 -> Seq Scan on t2 Filter: (b = t1.b) (7 rows) When the subplan is in a Qual, it is supposed to be executed as soon as possible, The current implementation matches the below cases. So can we say we knows the num_execs of SubPlan just after we plan the dependent rels? (In Q1 below the dependent rel is t1 vs t3, in Q2 it is t1 only) If we can choose a subplan and recost the related path during(not after) creating the best path, will we get better results for some cases (due to the current cost method for AlternativeSubPlan[1])? -- the subplan depends on the result of t1 join t3 # explain (costs off) select t1.* from t1, t3 where t1.a > (select max(a) from t2 where t2.b = t1.b and t2.c = t3.c); QUERY PLAN - Nested Loop Join Filter: (t1.a > (SubPlan 1)) -> Seq Scan on t1 -> Materialize -> Seq Scan on t3 SubPlan 1 -> Aggregate -> Seq Scan on t2 Filter: ((b = t1.b) AND (c = t3.c)) (9 rows) -- the subplan only depends on t1. # explain (costs off) select t1.* from t1, t3 where t1.a > (select max(a) from t2 where t2.b = t1.b); QUERY PLAN Nested Loop -> Seq Scan on t3 -> Materialize -> Seq Scan on t1 Filter: (a > (SubPlan 1)) SubPlan 1 -> Aggregate -> Seq Scan on t2 Filter: (b = t1.b) (9 rows) At last, I want to use the commonly used table in src/test/regress/sql/create_table.sql when providing an example, but I always have issues running the create_table.sql which makes me uncomfortable to use that. Am I missing something? CREATE TABLE fail_part PARTITION OF range_parted3 FOR VALUES FROM (1, minvalue) TO (1, maxvalue); psql:src/test/regress/sql/create_table.sql:611: ERROR: partition "fail_part" would overlap partition "part10" CREATE TABLE fail_part PARTITION OF hash_parted2 FOR VALUES WITH (MODULUS 2, REMAINDER 1); psql:src/test/regress/sql/create_table.sql:622: ERROR: partition "fail_part" woul
Re: Terminate the idle sessions
On Tue, Aug 18, 2020 at 2:13 PM Li Japin wrote: > On Aug 18, 2020, at 9:19 AM, Kyotaro Horiguchi > wrote: > The same already happens for idle_in_transaction_session_timeout and > we can use "ALTER ROLE/DATABASE SET" to dislable or loosen them, it's > a bit cumbersome, though. I don't think we should (at least > implicitly) disable those timeouts ad-hockerly for postgres_fdw. > > +1. This seems like a reasonable feature to me. The delivery of the error message explaining what happened is probably not reliable, so to some clients and on some operating systems this will be indistinguishable from a dropped network connection or other error, but that's OK and we already have that problem with the existing timeout-based disconnection feature. The main problem I have with it is the high frequency setitimer() calls. If you enable both statement_timeout and idle_session_timeout, then we get up to huge number of system calls, like the following strace -c output for a few seconds of one backend under pgbench -S workload shows: % time seconds usecs/call callserrors syscall -- --- --- - - 39.450.118685 0250523 setitimer 29.980.090200 0125275 sendto 24.300.073107 0126235 973 recvfrom 6.010.018068 0 20950 pread64 0.260.000779 0 973 epoll_wait -- --- --- - - 100.000.300839523956 973 total There's a small but measurable performance drop from this, as also discussed in another thread about another kind of timeout[1]. Maybe we should try to fix that with something like the attached? [1] https://www.postgresql.org/message-id/flat/77def86b27e41f0efcba411460e929ae%40postgrespro.ru 0001-Optimize-setitimer-usage.patchx Description: Binary data
Re: list of extended statistics on psql
On 2020/08/31 1:59, Tom Lane wrote: Tomas Vondra writes: On Sun, Aug 30, 2020 at 12:33:29PM -0400, Alvaro Herrera wrote: I wonder how to report that. Knowing that psql \-commands are not meant for anything other than human consumption, maybe we can use a format() string that says "built: %d bytes" when \dX+ is used (for each stat type), and just "built" when \dX is used. What do people think about this? Seems a little too cute to me. I'd use the same approach as \d+, i.e. a separate column with the size. Maybe that'd mean too many columns, though. psql already has \d commands with so many columns that you pretty much have to use \x mode to make them legible; \df+ for instance. I don't mind if \dX+ is also in that territory. It'd be good though if plain \dX can fit in a normal terminal window. Hmm. How about these instead of "built: %d bytes"? I added three columns (N_size, D_size, M_size) to show size. See below: === postgres=# \dX List of extended statistics Schema | Name| Definition | N_distinct | Dependencies | Mcv +---+++--+--- public | stts_1| a, b FROM t1 || built| public | stts_2| a, b FROM t1 | built | built| public | stts_3| a, b FROM t1 | built | built| built public | stts_4| b, c FROM t2 | not built | not built| not built public | stts_hoge | col1, col2, col3 FROM hoge | not built | not built| not built (5 rows) postgres=# \dX+ List of extended statistics Schema | Name| Definition | N_distinct | Dependencies | Mcv| N_size | D_size | M_size +---+++--+---+++ public | stts_1| a, b FROM t1 || built| || 40 | public | stts_2| a, b FROM t1 | built | built| | 13 | 40 | public | stts_3| a, b FROM t1 | built | built| built | 13 | 40 | 6126 public | stts_4| b, c FROM t2 | not built | not built| not built ||| public | stts_hoge | col1, col2, col3 FROM hoge | not built | not built| not built ||| === I used this query to get results of "\dX+". === SELECT stxnamespace::pg_catalog.regnamespace AS "Schema", stxname AS "Name", format('%s FROM %s', (SELECT pg_catalog.string_agg(pg_catalog.quote_ident(attname),', ') FROM pg_catalog.unnest(stxkeys) s(attnum) JOIN pg_catalog.pg_attribute a ON (stxrelid = a.attrelid AND a.attnum = s.attnum AND NOT attisdropped)), stxrelid::regclass) AS "Definition", CASE WHEN esd.stxdndistinct IS NOT NULL THEN 'built' WHEN 'd' = any(stxkind) THEN 'not built' END AS "N_distinct", CASE WHEN esd.stxddependencies IS NOT NULL THEN 'built' WHEN 'f' = any(stxkind) THEN 'not built' END AS "Dependencies", CASE WHEN esd.stxdmcv IS NOT NULL THEN 'built' WHEN 'm' = any(stxkind) THEN 'not built' END AS "Mcv", pg_catalog.length(stxdndistinct) AS "N_size", pg_catalog.length(stxddependencies) AS "D_size", pg_catalog.length(stxdmcv) AS "M_size" FROM pg_catalog.pg_statistic_ext es INNER JOIN pg_catalog.pg_class c ON stxrelid = c.oid LEFT JOIN pg_catalog.pg_statistic_ext_data esd ON es.oid = esd.stxoid ORDER BY 1, 2; === Attached patch includes: - Replace "Columns" and "Table" column with "Definition" - Show the status (built/not built/null) of extended stats by using pg_statistic_ext_data - Add "\dX+" command to show size of extended stats Please find the attached file! :-D Thanks, Tatsuro Yamada diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml index fc16e6c..5664c22 100644 --- a/doc/src/sgml/ref/psql-ref.sgml +++ b/doc/src/sgml/ref/psql-ref.sgml @@ -1893,6 +1893,20 @@ testdb=> + + +\dX [ pattern ] + + +Lists extended statistics. +If pattern +is specified, only those extended statistics whose names match the pattern +are listed. +If + is appended to the command name, each extended statistics +is listed with its size. + + + \dy[+] [ pattern ] diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c index 9902a4a..077a585 100644 --- a/src/bin/psql/command.c +++ b/src/bin/psql/command.c @@
Re: Get rid of runtime handling of AlternativeSubPlan?
On Sun, 30 Aug 2020 at 11:26, Tom Lane wrote: > > I wrote: > > Therefore, I'm considering the idea of ripping out all executor support > > for AlternativeSubPlan and instead having the planner replace an > > AlternativeSubPlan with the desired specific SubPlan somewhere late in > > planning, possibly setrefs.c. > > Here's a proposed patchset for that. Do you feel that the choice to create_plan() on the subplan before planning the outer query is still a good one? ISTM that that was required when the AlternativeSubplan decision was made during execution, since we, of course, need a plan to execute. If the decision is now being made in the planner then is it not better to delay the create_plan() until later in planning? >From looking at the code it seems that Paths won't really do here as we're dealing with two separate PlannerInfos rather than two paths belonging to the same PlannerInfo, but maybe it's better to invent something else that's similar to a list of paths and just do create_plan() for the subquery once. David
Re: Terminate the idle sessions
On Aug 31, 2020, at 8:51 AM, Thomas Munro mailto:thomas.mu...@gmail.com>> wrote: The main problem I have with it is the high frequency setitimer() calls. If you enable both statement_timeout and idle_session_timeout, then we get up to huge number of system calls, like the following strace -c output for a few seconds of one backend under pgbench -S workload shows: % time seconds usecs/call callserrors syscall -- --- --- - - 39.450.118685 0250523 setitimer 29.980.090200 0125275 sendto 24.300.073107 0126235 973 recvfrom 6.010.018068 0 20950 pread64 0.260.000779 0 973 epoll_wait -- --- --- - - 100.000.300839523956 973 total Hi, Thomas, Could you give the more details about the test instructions?
Re: [PATCH v1] explicit_bzero.c: using explicit_memset on NetBSD
On Sun, Aug 30, 2020 at 02:03:32PM +0100, David CARLIER wrote: > Thanks. During the addition of explicit_bzero(), there was an agreement to use memset_s(), as it is blessed by the standard: https://www.postgresql.org/message-id/20190717211931.GA906@alvherre.pgsql So what would be the advantage of explicit_memset() knowing that NetBSD has memset_s()? This also means that your patch is a no-op for NetBSD as HAVE_MEMSET_S would be set. -- Michael signature.asc Description: PGP signature
Re: Terminate the idle sessions
On Mon, Aug 31, 2020 at 2:40 PM Li Japin wrote: > Could you give the more details about the test instructions? Hi Japin, Sure. Because I wasn't trying to get reliable TPS number or anything, I just used a simple short read-only test with one connection, like this: pgbench -i -s10 postgres pgbench -T60 -Mprepared -S postgres Then I looked for the active backend and ran strace -c -p XXX for a few seconds and hit ^C to get the counters. I doubt the times are very accurate, but the number of calls is informative. If you do that on a server running with -c statement_timeout=10s, you see one setitimer() per transaction. If you also use -c idle_session_timeout=10s at the same time, you see two.
Re: Terminate the idle sessions
At Mon, 31 Aug 2020 12:51:20 +1200, Thomas Munro wrote in > On Tue, Aug 18, 2020 at 2:13 PM Li Japin wrote: > > On Aug 18, 2020, at 9:19 AM, Kyotaro Horiguchi > > wrote: > > The same already happens for idle_in_transaction_session_timeout and > > we can use "ALTER ROLE/DATABASE SET" to dislable or loosen them, it's > > a bit cumbersome, though. I don't think we should (at least > > implicitly) disable those timeouts ad-hockerly for postgres_fdw. > > > > +1. > > This seems like a reasonable feature to me. > > The delivery of the error message explaining what happened is probably > not reliable, so to some clients and on some operating systems this > will be indistinguishable from a dropped network connection or other > error, but that's OK and we already have that problem with the > existing timeout-based disconnection feature. > > The main problem I have with it is the high frequency setitimer() > calls. If you enable both statement_timeout and idle_session_timeout, > then we get up to huge number of system calls, like the following > strace -c output for a few seconds of one backend under pgbench -S > workload shows: > > % time seconds usecs/call callserrors syscall > -- --- --- - - > 39.450.118685 0250523 setitimer > 29.980.090200 0125275 sendto > 24.300.073107 0126235 973 recvfrom > 6.010.018068 0 20950 pread64 > 0.260.000779 0 973 epoll_wait > -- --- --- - - > 100.000.300839523956 973 total > > There's a small but measurable performance drop from this, as also > discussed in another thread about another kind of timeout[1]. Maybe > we should try to fix that with something like the attached? > > [1] > https://www.postgresql.org/message-id/flat/77def86b27e41f0efcba411460e929ae%40postgrespro.ru I think it's worth doing. Maybe we can get rid of doing anything other than removing an entry in the case where we disable a non-nearest timeout. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
v13: show extended stats target in \d
The stats target can be set since commit d06215d03, but wasn't shown by psql. ALTER STATISISTICS .. SET STATISTICS n. Normal (1-D) stats targets are shown in \d+ table. Stats objects are shown in \d (no plus). Arguably, this should be shown only in "verbose" mode (\d+). >From 60a4033a04fbaafbb01c2bb2d73bb2fbe4d1da75 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Sun, 30 Aug 2020 23:38:17 -0500 Subject: [PATCH 1/1] Show stats target of extended statistics This can be set since d06215d03, but wasn't shown by psql. Normal (1-D) stats targets are shown in \d+ table. --- src/bin/psql/describe.c | 8 +++- src/test/regress/expected/stats_ext.out | 18 ++ src/test/regress/sql/stats_ext.sql | 2 ++ 3 files changed, 27 insertions(+), 1 deletion(-) diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c index d81f1575bf..73befa36ee 100644 --- a/src/bin/psql/describe.c +++ b/src/bin/psql/describe.c @@ -2683,10 +2683,12 @@ describeOneTableDetails(const char *schemaname, "a.attnum = s.attnum AND NOT attisdropped)) AS columns,\n" " 'd' = any(stxkind) AS ndist_enabled,\n" " 'f' = any(stxkind) AS deps_enabled,\n" - " 'm' = any(stxkind) AS mcv_enabled\n" + " 'm' = any(stxkind) AS mcv_enabled,\n" + " %s" "FROM pg_catalog.pg_statistic_ext stat " "WHERE stxrelid = '%s'\n" "ORDER BY 1;", + (pset.sversion >= 13 ? "stxstattarget\n" : "-1\n"), oid); result = PSQLexec(buf.data); @@ -2732,6 +2734,10 @@ describeOneTableDetails(const char *schemaname, PQgetvalue(result, i, 4), PQgetvalue(result, i, 1)); + if (strcmp(PQgetvalue(result, i, 8), "-1") != 0) + appendPQExpBuffer(&buf, " STATISTICS %s", + PQgetvalue(result, i, 8)); + printTableAddFooter(&cont, buf.data); } } diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out index 0ae779a3b9..4dea48a2e8 100644 --- a/src/test/regress/expected/stats_ext.out +++ b/src/test/regress/expected/stats_ext.out @@ -102,6 +102,15 @@ WARNING: statistics object "public.ab1_a_b_stats" could not be computed for rel ALTER TABLE ab1 ALTER a SET STATISTICS -1; -- setting statistics target 0 skips the statistics, without printing any message, so check catalog ALTER STATISTICS ab1_a_b_stats SET STATISTICS 0; +\d+ ab1 +Table "public.ab1" + Column | Type | Collation | Nullable | Default | Storage | Stats target | Description ++-+---+--+-+-+--+- + a | integer | | | | plain | | + b | integer | | | | plain | | +Statistics objects: +"public"."ab1_a_b_stats" (ndistinct, dependencies, mcv) ON a, b FROM ab1 STATISTICS 0 + ANALYZE ab1; SELECT stxname, stxdndistinct, stxddependencies, stxdmcv FROM pg_statistic_ext s, pg_statistic_ext_data d @@ -113,6 +122,15 @@ SELECT stxname, stxdndistinct, stxddependencies, stxdmcv (1 row) ALTER STATISTICS ab1_a_b_stats SET STATISTICS -1; +\d+ ab1 +Table "public.ab1" + Column | Type | Collation | Nullable | Default | Storage | Stats target | Description ++-+---+--+-+-+--+- + a | integer | | | | plain | | + b | integer | | | | plain | | +Statistics objects: +"public"."ab1_a_b_stats" (ndistinct, dependencies, mcv) ON a, b FROM ab1 + -- partial analyze doesn't build stats either ANALYZE ab1 (a); WARNING: statistics object "public.ab1_a_b_stats" could not be computed for relation "public.ab1" diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql index 2834a902a7..d1d49948b4 100644 --- a/src/test/regress/sql/stats_ext.sql +++ b/src/test/regress/sql/stats_ext.sql @@ -72,12 +72,14 @@ ANALYZE ab1; ALTER TABLE ab1 ALTER a SET STATISTICS -1; -- setting statistics target 0 skips the statistics, without printing any message, so check catalog ALTER STATISTICS ab1_a_b_stats SET STATISTICS 0; +\d+ ab1 ANALYZE ab1; SELECT stxname, stxdndistinct, stxddependencies, stxdmcv FROM pg_statistic_ext s, pg_statistic_ext_data d WHERE s.stxname = 'ab1_a_b_stats' AND d.stxoid = s.oid; ALTER STATISTICS ab1_a_b_stats SET STATISTICS -1; +\d+ ab1 -- partial analyze doesn't build stats either ANALYZE ab1 (a); ANALYZE ab1; -- 2.17.0
Re: Compatible defaults for LEAD/LAG
ne 30. 8. 2020 v 23:59 odesílatel Tom Lane napsal: > Pavel Stehule writes: > > This is nice example of usage of anycompatible type (that is consistent > > with other things in Postgres), but standard says something else. > > It can be easily solved with https://commitfest.postgresql.org/28/2081/ > - > > but Tom doesn't like this patch. > > I am more inclined to think so this feature should be implemented > > differently - there is no strong reason to go against standard or against > > the implementations of other databases (and increase the costs of > porting). > > Now the implementation is limited, but allowed behaviour is 100% ANSI. > > I don't particularly buy this argument. The case at hand is what to do > if we have, say, > > select lag(integer_column, 1, 1.2) over ... > > The proposed patch would result in the output being of type numeric, > and any rows using the default would show "1.2". The spec says that > the right thing is to return integer, and we should round the default > to "1" to make that work. But > > (1) I doubt that anybody actually writes such things; > > (2) For anyone who does write it, the spec's behavior fails to meet > the principle of least surprise. It is not normally the case that > any information-losing cast would be applied silently within an > expression. > postgres=# create table foo(a int); CREATE TABLE postgres=# insert into foo values(1.1); INSERT 0 1 postgres=# create table foo(a int default 1.1); CREATE TABLE postgres=# insert into foo values(default); INSERT 0 1 postgres=# select * from foo; ┌───┐ │ a │ ╞═══╡ │ 1 │ └───┘ (1 row) It is maybe strange, but it is not an unusual pattern in SQL. I think it can be analogy with default clause DECLARE a int DEFAULT 1.2; The default value doesn't change a type of variable. This is maybe too stupid example. There can be other little bit more realistic CREATE OR REPLACE FUNCTION foo(a numeric, b numeric, ... DECLARE x int DEFAULT a; BEGIN ... I am afraid about performance - if default value can change type, then some other things can stop work - like using index. For *this* case we don't speak about some operations between two independent operands or function arguments. We are speaking about the value and about a *default* for the value. > So this deviation from spec doesn't bother me; we have much bigger ones. > ok, if it is acceptable for other people, I can accept it too - I understand well so it is a corner case and there is some consistency with other Postgres features. Maybe this difference should be mentioned in documentation. > My concern with this patch is what I said upthread: I'm not sure that > we should be adjusting this behavior in a piecemeal fashion. I looked > through pg_proc to find all the functions that have more than one any* > argument, and found these: > > oid | prorettype > ---+ > lag(anyelement,integer,anyelement)| anyelement > lead(anyelement,integer,anyelement) | anyelement > width_bucket(anyelement,anyarray) | integer > btarraycmp(anyarray,anyarray) | integer > array_eq(anyarray,anyarray) | boolean > array_ne(anyarray,anyarray) | boolean > array_lt(anyarray,anyarray) | boolean > array_gt(anyarray,anyarray) | boolean > array_le(anyarray,anyarray) | boolean > array_ge(anyarray,anyarray) | boolean > array_append(anyarray,anyelement) | anyarray > array_prepend(anyelement,anyarray)| anyarray > array_cat(anyarray,anyarray) | anyarray > array_larger(anyarray,anyarray) | anyarray > array_smaller(anyarray,anyarray) | anyarray > array_position(anyarray,anyelement) | integer > array_position(anyarray,anyelement,integer) | integer > array_positions(anyarray,anyelement) | integer[] > array_remove(anyarray,anyelement) | anyarray > array_replace(anyarray,anyelement,anyelement) | anyarray > arrayoverlap(anyarray,anyarray) | boolean > arraycontains(anyarray,anyarray) | boolean > arraycontained(anyarray,anyarray) | boolean > elem_contained_by_range(anyelement,anyrange) | boolean > range_contains_elem(anyrange,anyelement) | boolean > range_eq(anyrange,anyrange) | boolean > range_ne(anyrange,anyrange) | boolean > range_overlaps(anyrange,anyrange) | boolean > range_contains(anyrange,anyrange) | boolean > range_contained_by(anyrange,anyrange) | boolean > range_adjacent(anyrange,anyrange) | boolean > range_before(anyrange,anyrange) | boolean > range_after(anyrange,anyrange)| boolean > range_overleft(anyrange,anyrange) | boolean > range_overr
Re: Re: [HACKERS] Custom compression methods
On Thu, 13 Aug 2020 at 17:18, Dilip Kumar wrote: > I have rebased the patch on the latest head and currently, broken into 3 > parts. > > v1-0001: As suggested by Robert, it provides the syntax support for > setting the compression method for a column while creating a table and > adding columns. However, we don't support changing the compression > method for the existing column. As part of this patch, there is only > one built-in compression method that can be set (pglz). In this, we > have one in-build am (pglz) and the compressed attributes will directly > store the oid of the AM. In this patch, I have removed the > pg_attr_compresion as we don't support changing the compression > for the existing column so we don't need to preserve the old > compressions. > v1-0002: Add another built-in compression method (zlib) > v1:0003: Remaining patch set (nothing is changed except rebase on the > current head, stabilizing check-world and 0001 and 0002 are pulled > out of this) > > Next, I will be working on separating out the remaining patches as per > the suggestion by Robert. Thanks for this new feature. Looks promising and very useful, with so many good compression libraries already available. I see that with the patch-set, I would be able to create an extension that defines a PostgreSQL C handler function which assigns all the required hook function implementations for compressing, decompressing and validating, etc. In short, I would be able to use a completely different compression algorithm to compress toast data if I write such an extension. Correct me if I am wrong with my interpretation. Just a quick superficial set of review comments A minor re-base is required due to a conflict in a regression test - In heap_toast_insert_or_update() and in other places, the comments for new parameter preserved_am_info are missing. - +toast_compress_datum(Datum value, Oid acoid) { struct varlena *tmp = NULL; int32 valsize; - CompressionAmOptions cmoptions; + CompressionAmOptions *cmoptions = NULL; I think tmp and cmoptions need not be initialized to NULL - - TOAST_COMPRESS_SET_RAWSIZE(tmp, valsize); - SET_VARSIZE_COMPRESSED(tmp, len + TOAST_COMPRESS_HDRSZ); /* successful compression */ + toast_set_compressed_datum_info(tmp, amoid, valsize); return PointerGetDatum(tmp); Any particular reason why is this code put in a new extern function ? Is there a plan to re-use it ? Otherwise, it's not necessary to do this. Also, not sure why "HTAB *amoptions_cache" and "MemoryContext amoptions_cache_mcxt" aren't static declarations. They are being used only in toast_internals.c --- The tab-completion doesn't show COMPRESSION : postgres=# create access method my_method TYPE INDEX TABLE postgres=# create access method my_method TYPE Also, the below syntax also would better be tab-completed so as to display all the installed compression methods, in line with how we show all the storage methods like plain,extended,etc: postgres=# ALTER TABLE lztab ALTER COLUMN t SET COMPRESSION I could see the differences in compression ratio, and the compression and decompression speed when I use lz versus zib : CREATE TABLE zlibtab(t TEXT COMPRESSION zlib WITH (level '4')); create table lztab(t text); ALTER TABLE lztab ALTER COLUMN t SET COMPRESSION pglz; pgg:s2:pg$ time psql -c "\copy zlibtab from text.data" COPY 13050 real0m1.344s user0m0.031s sys 0m0.026s pgg:s2:pg$ time psql -c "\copy lztab from text.data" COPY 13050 real0m2.088s user0m0.008s sys 0m0.050s pgg:s2:pg$ time psql -c "select pg_table_size('zlibtab'::regclass), pg_table_size('lztab'::regclass)" pg_table_size | pg_table_size ---+--- 1261568 | 1687552 pgg:s2:pg$ time psql -c "select NULL from zlibtab where t like ''" > /dev/null real0m0.127s user0m0.000s sys 0m0.002s pgg:s2:pg$ time psql -c "select NULL from lztab where t like ''" > /dev/null real0m0.050s user0m0.002s sys 0m0.000s -- Thanks, -Amit Khandekar Huawei Technologies
Re: list of extended statistics on psql
On Thu, Aug 27, 2020 at 07:53:23PM -0400, Alvaro Herrera wrote: > +1 for the general idea, and +1 for \dX being the syntax to use > > IMO the per-type columns should show both the type being enabled as > well as it being built. > > (How many more stat types do we expect -- Tomas? I wonder if having one > column per type is going to scale in the long run.) > > Also, the stat obj name column should be first, followed by a single > column listing both table and columns that it applies to. Keep in mind > that in the future we might want to add stats that cross multiple tables > -- that's why the CREATE syntax is the way it is. So we should give > room for that in psql's display too. There's also a plan for CREATE STATISTICS to support expresion statistics, with the statistics functionality of an expression index, but without the cost of index-update on UPDATE/DELETE. That's Tomas' patch here: https://commitfest.postgresql.org/29/2421/ I think that would compute ndistinct and MCV, same as indexes, but not dependencies. To me, I think it's better if there's a single column showing the "kinds" of statistics to be generated (stxkind), rather than a column for each. I'm not sure why the length of the stats lists cast as text is useful to show? We don't have a slash-dee command to show the number of MCV or histogram in traditional, 1-D stats in pg_statistic, right ? I think anybody wanting that would learn to SELECT FROM pg_statistic*. Also, the length of the text output isn't very meaningful ? If this is json, maybe you'd do something like this: |SELECT a.stxdndistinct , COUNT(b) FROM pg_statistic_ext_data a , json_each(stxdndistinct::Json) AS b GROUP BY 1 I guess stxdmcv isn't json, but it seems especially meaningless to show length() of its ::text, since we don't even "deserialize" the object to begin with. BTW, I've just started a new thread about displaying in psql \d the stats target of target extended stats. -- Justin
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Sun, Aug 30, 2020 at 2:43 PM Dilip Kumar wrote: > > On Sat, Aug 29, 2020 at 5:18 PM Amit Kapila wrote: > > > > > One more comment for which I haven't done anything yet. > > +static void > > +set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId > > xid) > > +{ > > + MemoryContext oldctx; > > + > > + oldctx = MemoryContextSwitchTo(CacheMemoryContext); > > + > > + entry->streamed_txns = lappend_int(entry->streamed_txns, xid); > > > Is it a good idea to append xid with lappend_int? Won't we need > > something equivalent for uint32? If so, I think we have a couple of > > options (a) use lcons method and accordingly append the pointer to > > xid, I think we need to allocate memory for xid if we want to use this > > idea or (b) use an array instead. What do you think? > > BTW, OID is internally mapped to uint32, but using lappend_oid might > not look good. So maybe we can provide an option for lappend_uint32? > Using an array is also not a bad idea. Providing lappend_uint32 > option looks more appealing to me. > I thought about this again and I feel it might be okay to use it for our case as after storing it in T_IntList, we primarily fetch it for comparison with TrasnactionId (uint32), so this shouldn't create any problem. I feel we can just discuss this in a separate thread and check the opinion of others, what do you think? Another comment: +cleanup_rel_sync_cache(TransactionId xid, bool is_commit) +{ + HASH_SEQ_STATUS hash_seq; + RelationSyncEntry *entry; + + Assert(RelationSyncCache != NULL); + + hash_seq_init(&hash_seq, RelationSyncCache); + while ((entry = hash_seq_search(&hash_seq)) != NULL) + { + if (is_commit) + entry->schema_sent = true; How is it correct to set 'entry->schema_sent' for all the entries in RelationSyncCache? Consider a case where due to invalidation in an unrelated transaction we have set the flag schema_sent for a particular relation 'r1' as 'false' and that transaction is executed before the current streamed transaction for which we are performing commit and called this function. It will set the flag for unrelated entry in this case 'r1' which doesn't seem correct to me. Or, if this is correct, it would be a good idea to write some comments about it. -- With Regards, Amit Kapila.
Re: Implementing Incremental View Maintenance
Hi, I updated the wiki page. https://wiki.postgresql.org/wiki/Incremental_View_Maintenance On Fri, 21 Aug 2020 21:40:50 +0900 (JST) Tatsuo Ishii wrote: > From: Yugo NAGATA > Subject: Re: Implementing Incremental View Maintenance > Date: Fri, 21 Aug 2020 17:23:20 +0900 > Message-ID: <20200821172320.a2506577d5244b6066f69...@sraoss.co.jp> > > > On Wed, 19 Aug 2020 10:02:42 +0900 (JST) > > Tatsuo Ishii wrote: > > > >> I have looked into this. > > > > Thank you for your reviewing! > > > >> - 0004-Allow-to-prolong-life-span-of-transition-tables-unti.patch: > >> This one needs a comment to describe what the function does etc. > >> > >> +void > >> +SetTransitionTablePreserved(Oid relid, CmdType cmdType) > >> +{ > > > > I added a comment for this function and related places. > > > > +/* > > + * SetTransitionTablePreserved > > + * > > + * Prolong lifespan of transition tables corresponding specified relid and > > + * command type to the end of the outmost query instead of each nested > > query. > > + * This enables to use nested AFTER trigger's transition tables from outer > > + * query's triggers. Currently, only immediate incremental view > > maintenance > > + * uses this. > > + */ > > +void > > +SetTransitionTablePreserved(Oid relid, CmdType cmdType) > > > > Also, I removed releted unnecessary code which was left accidentally. > > > > > >> - 0007-Add-aggregates-support-in-IVM.patch > >> "Check if the given aggregate function is supporting" shouldn't be > >> "Check if the given aggregate function is supporting IVM"? > > > > Yes, you are right. I fixed this, too. > > > >> > >> + * check_aggregate_supports_ivm > >> + * > >> + * Check if the given aggregate function is supporting > > Thanks for the fixes. I have changed the commit fest status to "Ready > for Committer". > > Best regards, > -- > Tatsuo Ishii > SRA OSS, Inc. Japan > English: http://www.sraoss.co.jp/index_en.php > Japanese:http://www.sraoss.co.jp > > -- Yugo NAGATA
Re: Terminate the idle sessions
> On Aug 31, 2020, at 11:43 AM, Thomas Munro wrote: > > On Mon, Aug 31, 2020 at 2:40 PM Li Japin wrote: >> Could you give the more details about the test instructions? > > Hi Japin, > > Sure. Because I wasn't trying to get reliable TPS number or anything, > I just used a simple short read-only test with one connection, like > this: > > pgbench -i -s10 postgres > pgbench -T60 -Mprepared -S postgres > > Then I looked for the active backend and ran strace -c -p XXX for a > few seconds and hit ^C to get the counters. I doubt the times are > very accurate, but the number of calls is informative. > > If you do that on a server running with -c statement_timeout=10s, you > see one setitimer() per transaction. If you also use -c > idle_session_timeout=10s at the same time, you see two. Hi, Thomas, Thanks for your point out this problem, here is the comparison. Without Optimize settimer usage: % time seconds usecs/call callserrors syscall -- --- --- - - 41.221.444851 1 1317033 setitimer 28.410.995936 2658622 sendto 24.630.863316 1659116 599 recvfrom 5.710.200275 2111055 pread64 0.030.001152 2 599 epoll_wait 0.000.00 0 1 epoll_ctl -- --- --- - - 100.003.505530 2746426 599 total With Optimize settimer usage: % time seconds usecs/call callserrors syscall -- --- --- - - 49.891.464332 1 1091429 sendto 40.831.198389 1 1091539 219 recvfrom 9.260.271890 1183321 pread64 0.020.000482 2 214 epoll_wait 0.000.13 3 5 setitimer 0.000.10 2 5 rt_sigreturn 0.000.00 0 1 epoll_ctl -- --- --- - - 100.002.935116 2366514 219 total Here’s a modified version of Thomas’s patch. v3-0001-Allow-terminating-the-idle-sessions.patch Description: v3-0001-Allow-terminating-the-idle-sessions.patch v3-0002-Optimize-setitimer-usage.patch Description: v3-0002-Optimize-setitimer-usage.patch
Re: Hybrid Hash/Nested Loop joins and caching results from subplans
On Sat, Aug 29, 2020 at 3:33 AM Robert Haas wrote: > I think David's points elsewhere on the thread about ProjectSet and > Materialize nodes are interesting. Indeed, I'm now finding it very difficult to look past the similarity with: postgres=# explain select count(*) from t t1 cross join t t2; QUERY PLAN Aggregate (cost=1975482.56..1975482.57 rows=1 width=8) -> Nested Loop (cost=0.00..1646293.50 rows=131675625 width=0) -> Seq Scan on t t1 (cost=0.00..159.75 rows=11475 width=0) -> Materialize (cost=0.00..217.12 rows=11475 width=0) -> Seq Scan on t t2 (cost=0.00..159.75 rows=11475 width=0) (5 rows) I wonder what it would take to overcome the overheads of the separate Result Cache node, with techniques to step out of the way or something like that. > [tricky philosophical questions about ancient and maybe in some cases > arbitrary choices] Ack.
Re: More tests with USING INDEX replident and dropped indexes
On Mon, Aug 31, 2020 at 10:36:13AM +0530, Rahila wrote: > Now, the relreplident is being set in the transaction previous to > the one marking index as invalid for concurrent drop. Won't > it cause issues with relreplident cleared but index not dropped, > if drop index concurrently fails in the small window after > commit transaction in above snippet and before the start transaction above? Argh. I missed your point that this stuff uses heap_inplace_update(), so the update of indisvalid flag is not transactional. The thing is that we won't be able to update the flag consistently with relreplident except if we switch index_set_state_flags() to use a transactional operation here. So, with the current state of affairs, it looks like we could just call SetRelationReplIdent() in the last transaction, after the transaction marking the index as dead has committed (see the top of index_set_state_flags() saying that this should be the last step of a transaction), but that leaves a window open as you say. On the other hand, it looks appealing to make index_set_state_flags() transactional. This would solve the consistency problem, and looking at the code scanning pg_index, I don't see a reason why we could not do that. What do you think? > To the existing partitioned table test, can we add a test to demonstrate > that relreplident is set to 'n' for even the individual partitions. Makes sense. It is still important to test the case where a partitioned table without leaves is correctly reset IMO. > Typo, s/updates are visible/updates visible Thanks. > - For DROP INDEX CONCURRENTLY, pg_class.relreplident of the parent > table is set in the last transaction doing the drop. It would be > tempting to reset the flag in the same transaction as the one marking > the index as invalid, but there could be a point in reindexing the > invalid index whose drop has failed, and this adds morecomplexity to > the patch. That's a point I studied, but it makes actually more sense to just reset the flag once the index is marked as invalid, similarly to indisclustered. Reindexing an invalid index can have value in some cases, but we would have much more problems with the relcache if we finish with two indexes marked as indreplident :( -- Michael signature.asc Description: PGP signature
Re: please update ps display for recovery checkpoint
On Thu, Aug 20, 2020 at 05:09:05PM +0900, Michael Paquier wrote: > That could be helpful. Wouldn't it be better to use "end-of-recovery > checkpoint" instead? That's the common wording in the code comments. > > I don't see the point of patch 0002. In the same paragraph, we > already know that this applies to any checkpoints. Thinking more about this.. Could it be better to just add some calls to set_ps_display() directly in CreateCheckPoint()? This way, both the checkpointer as well as the startup process at the end of recovery would benefit from the change. -- Michael signature.asc Description: PGP signature