Re: index paths and enable_indexscan
On Tue, Apr 14, 2020 at 2:44 PM Amit Langote wrote: > Hi, > > Maybe I am missing something obvious, but is it intentional that > enable_indexscan is checked by cost_index(), that is, *after* creating > an index path? I was expecting that if enable_indexscan is off, then > no index paths would be generated to begin with, because I thought > they are optional. > I think the cost estimate of index paths is the same as other paths on that setting enable_xxx to off only adds a penalty factor (disable_cost) to the path's cost. The path would be still generated and compete with other paths in add_path(). Thanks Richard
Re: pgsql: Improve handling of parameter differences in physical replicatio
On 2020-03-30 20:10, Andres Freund wrote: Also, shouldn't dynahash be adjusted as well? There's e.g. the following HASH_ENTER path: /* report a generic message */ if (hashp->isshared) ereport(ERROR, (errcode(ERRCODE_OUT_OF_MEMORY), errmsg("out of shared memory"))); Could you explain further what you mean by this? I don't understand how this is related. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PG compilation error with Visual Studio 2015/2017/2019
On Fri, Apr 10, 2020 at 5:33 PM Amit Kapila wrote: > > It seems the right direction to use GetLocaleInfoEx as we have already > used it to handle a similar problem (lc_codepage is missing in > _locale_t in higher versions of MSVC (cf commit 0fb54de9aa)) in > chklocale.c. However, I see that we have added a manual parsing there > if GetLocaleInfoEx doesn't parse it. I think that addresses your > concern for _create_locale handling bigger sets. Don't we need > something equivalent here for the cases which GetLocaleInfoEx doesn't > support? I am in investigating in similar lines, I think the following explanation can help. >From Microsoft doc. The locale argument to the setlocale, _wsetlocale, _create_locale, and _wcreate_locale is locale :: "locale-name" | *"language[_country-region[.code-page]]"* | ".code-page" | "C" | "" | NULL For GetLocaleInfoEx locale argument is *-* -
Re: index paths and enable_indexscan
On Tue, Apr 14, 2020 at 4:13 PM Richard Guo wrote: > On Tue, Apr 14, 2020 at 2:44 PM Amit Langote wrote: >> Maybe I am missing something obvious, but is it intentional that >> enable_indexscan is checked by cost_index(), that is, *after* creating >> an index path? I was expecting that if enable_indexscan is off, then >> no index paths would be generated to begin with, because I thought >> they are optional. > > > I think the cost estimate of index paths is the same as other paths on > that setting enable_xxx to off only adds a penalty factor (disable_cost) > to the path's cost. The path would be still generated and compete with > other paths in add_path(). Yeah, but I am asking why build the path to begin with, as there will always be seq scan path for base rels. Turning enable_hashjoin off, for example, means that no hash join paths will be built at all. Looking into the archives, I see that the idea of "not generating disabled paths to begin with" was discussed quite recently: https://www.postgresql.org/message-id/29821.1572706653%40sss.pgh.pa.us -- Amit Langote EnterpriseDB: http://www.enterprisedb.com
Display of buffers for planning time show nothing for second run
Hi I am testing some features from Postgres 13, and I am not sure if I understand well to behave of EXPLAIN(ANALYZE, BUFFERS) When I run following statement first time in session I get postgres=# EXPLAIN (BUFFERS, ANALYZE) SELECT * FROM obce WHERE okres_id = 'CZ0201'; ┌──┐ │ QUERY PLAN │ ╞══╡ │ Index Scan using obce_okres_id_idx on obce (cost=0.28..14.49 rows=114 width=41) (actual time=0.072..0.168 rows=114 loops=1) │ │ Index Cond: ((okres_id)::text = 'CZ0201'::text) │ │ Buffers: shared hit=4 │ │ Planning Time: 0.539 ms │ │ Buffers: shared hit=13 │ │ Execution Time: 0.287 ms │ └──┘ (6 rows) And I see share hit 13 in planning time. For second run I get postgres=# EXPLAIN (BUFFERS, ANALYZE) SELECT * FROM obce WHERE okres_id = 'CZ0201'; ┌──┐ │ QUERY PLAN │ ╞══╡ │ Index Scan using obce_okres_id_idx on obce (cost=0.28..14.49 rows=114 width=41) (actual time=0.044..0.101 rows=114 loops=1) │ │ Index Cond: ((okres_id)::text = 'CZ0201'::text) │ │ Buffers: shared hit=4 │ │ Planning Time: 0.159 ms │ │ Execution Time: 0.155 ms │ └──┘ (5 rows) Now, there is not any touch in planning time. Does it mean so this all these data are cached somewhere in session memory? Regards Pavel
Re: Display of buffers for planning time show nothing for second run
Hi, On Tue, Apr 14, 2020 at 10:18 AM Pavel Stehule wrote: > > Hi > > I am testing some features from Postgres 13, and I am not sure if I > understand well to behave of EXPLAIN(ANALYZE, BUFFERS) > > When I run following statement first time in session I get > > postgres=# EXPLAIN (BUFFERS, ANALYZE) SELECT * FROM obce WHERE okres_id = > 'CZ0201'; > ┌──┐ > │ QUERY PLAN > │ > ╞══╡ > │ Index Scan using obce_okres_id_idx on obce (cost=0.28..14.49 rows=114 > width=41) (actual time=0.072..0.168 rows=114 loops=1) │ > │ Index Cond: ((okres_id)::text = 'CZ0201'::text) > │ > │ Buffers: shared hit=4 > │ > │ Planning Time: 0.539 ms > │ > │ Buffers: shared hit=13 > │ > │ Execution Time: 0.287 ms > │ > └──┘ > (6 rows) > > And I see share hit 13 in planning time. > > For second run I get > > postgres=# EXPLAIN (BUFFERS, ANALYZE) SELECT * FROM obce WHERE okres_id = > 'CZ0201'; > ┌──┐ > │ QUERY PLAN > │ > ╞══╡ > │ Index Scan using obce_okres_id_idx on obce (cost=0.28..14.49 rows=114 > width=41) (actual time=0.044..0.101 rows=114 loops=1) │ > │ Index Cond: ((okres_id)::text = 'CZ0201'::text) > │ > │ Buffers: shared hit=4 > │ > │ Planning Time: 0.159 ms > │ > │ Execution Time: 0.155 ms > │ > └──┘ > (5 rows) > > Now, there is not any touch in planning time. Does it mean so this all these > data are cached somewhere in session memory? The planning time is definitely shorter the 2nd time. And yes, what you see are all the catcache accesses that are initially performed on a fresh new backend.
Re: index paths and enable_indexscan
On Tue, Apr 14, 2020 at 3:40 PM Amit Langote wrote: > On Tue, Apr 14, 2020 at 4:13 PM Richard Guo > wrote: > > On Tue, Apr 14, 2020 at 2:44 PM Amit Langote > wrote: > >> Maybe I am missing something obvious, but is it intentional that > >> enable_indexscan is checked by cost_index(), that is, *after* creating > >> an index path? I was expecting that if enable_indexscan is off, then > >> no index paths would be generated to begin with, because I thought > >> they are optional. > > > > > > I think the cost estimate of index paths is the same as other paths on > > that setting enable_xxx to off only adds a penalty factor (disable_cost) > > to the path's cost. The path would be still generated and compete with > > other paths in add_path(). > > Yeah, but I am asking why build the path to begin with, as there will > always be seq scan path for base rels. I guess that is because user may disable seqscan as well. If so, we still need formula to decide with one to use, which requires index path has to be calculated. but since disabling the two at the same time is rare, we can ignore the index path build if user allow seqscan > Turning enable_hashjoin off, > for example, means that no hash join paths will be built at all. > > As for join, the difference is even user allows a join method by setting, but the planner may still not able to use it. so the disabled path still need to be used. Consider query "select * from t1, t2 where f(t1.a, t2.a) = 3", and user setting is enable_nestloop = off, enable_hashjoin = on. But I think it is still possible to ignore the path generating after some extra checking. > Looking into the archives, I see that the idea of "not generating > disabled paths to begin with" was discussed quite recently: > https://www.postgresql.org/message-id/29821.1572706653%40sss.pgh.pa.us > > -- > > Amit Langote > EnterpriseDB: http://www.enterprisedb.com > > >
Re: Display of buffers for planning time show nothing for second run
út 14. 4. 2020 v 10:27 odesílatel Julien Rouhaud napsal: > Hi, > > On Tue, Apr 14, 2020 at 10:18 AM Pavel Stehule > wrote: > > > > Hi > > > > I am testing some features from Postgres 13, and I am not sure if I > understand well to behave of EXPLAIN(ANALYZE, BUFFERS) > > > > When I run following statement first time in session I get > > > > postgres=# EXPLAIN (BUFFERS, ANALYZE) SELECT * FROM obce WHERE okres_id > = 'CZ0201'; > > > ┌──┐ > > │ QUERY PLAN > │ > > > ╞══╡ > > │ Index Scan using obce_okres_id_idx on obce (cost=0.28..14.49 rows=114 > width=41) (actual time=0.072..0.168 rows=114 loops=1) │ > > │ Index Cond: ((okres_id)::text = 'CZ0201'::text) > │ > > │ Buffers: shared hit=4 > │ > > │ Planning Time: 0.539 ms > │ > > │ Buffers: shared hit=13 >│ > > │ Execution Time: 0.287 ms >│ > > > └──┘ > > (6 rows) > > > > And I see share hit 13 in planning time. > > > > For second run I get > > > > postgres=# EXPLAIN (BUFFERS, ANALYZE) SELECT * FROM obce WHERE okres_id > = 'CZ0201'; > > > ┌──┐ > > │ QUERY PLAN > │ > > > ╞══╡ > > │ Index Scan using obce_okres_id_idx on obce (cost=0.28..14.49 rows=114 > width=41) (actual time=0.044..0.101 rows=114 loops=1) │ > > │ Index Cond: ((okres_id)::text = 'CZ0201'::text) > │ > > │ Buffers: shared hit=4 > │ > > │ Planning Time: 0.159 ms > │ > > │ Execution Time: 0.155 ms >│ > > > └──┘ > > (5 rows) > > > > Now, there is not any touch in planning time. Does it mean so this all > these data are cached somewhere in session memory? > > The planning time is definitely shorter the 2nd time. And yes, what > you see are all the catcache accesses that are initially performed on > a fresh new backend. > One time Tom Lane mentioned using index in planning time for getting minimum and maximum. I expected so these values are not cached. But I cannot to reproduce it, and then I am little bit surprised so I don't see any hit in second, and other executions. Regards Pavel
Re: Display of buffers for planning time show nothing for second run
On Tue, Apr 14, 2020 at 5:27 PM Julien Rouhaud wrote: > On Tue, Apr 14, 2020 at 10:18 AM Pavel Stehule > wrote: > > For second run I get > > > > postgres=# EXPLAIN (BUFFERS, ANALYZE) SELECT * FROM obce WHERE okres_id = > > 'CZ0201'; > > ┌──┐ > > │ QUERY PLAN > >│ > > ╞══╡ > > │ Index Scan using obce_okres_id_idx on obce (cost=0.28..14.49 rows=114 > > width=41) (actual time=0.044..0.101 rows=114 loops=1) │ > > │ Index Cond: ((okres_id)::text = 'CZ0201'::text) > >│ > > │ Buffers: shared hit=4 > >│ > > │ Planning Time: 0.159 ms > >│ > > │ Execution Time: 0.155 ms > >│ > > └──┘ > > (5 rows) > > > > Now, there is not any touch in planning time. Does it mean so this all > > these data are cached somewhere in session memory? > > The planning time is definitely shorter the 2nd time. And yes, what > you see are all the catcache accesses that are initially performed on > a fresh new backend. By the way, even with all catcaches served from local memory, one may still see shared buffers being hit during planning. For example: explain (buffers, analyze) select * from foo where a = 1; QUERY PLAN --- Index Only Scan using foo_pkey on foo (cost=0.15..8.17 rows=1 width=4) (actual time=0.010..0.011 rows=0 loops=1) Index Cond: (a = 1) Heap Fetches: 0 Buffers: shared hit=2 Planning Time: 0.775 ms Buffers: shared hit=72 Execution Time: 0.086 ms (7 rows) Time: 2.477 ms postgres=# explain (buffers, analyze) select * from foo where a = 1; QUERY PLAN --- Index Only Scan using foo_pkey on foo (cost=0.15..8.17 rows=1 width=4) (actual time=0.012..0.012 rows=0 loops=1) Index Cond: (a = 1) Heap Fetches: 0 Buffers: shared hit=2 Planning Time: 0.102 ms Buffers: shared hit=1 Execution Time: 0.047 ms (7 rows) It seems that 1 Buffer hit comes from get_relation_info() doing _bt_getrootheight() for that index on foo. -- Amit Langote EnterpriseDB: http://www.enterprisedb.com
Re: Display of buffers for planning time show nothing for second run
On Tue, Apr 14, 2020 at 10:36 AM Pavel Stehule wrote: > > út 14. 4. 2020 v 10:27 odesílatel Julien Rouhaud napsal: >> >> Hi, >> >> On Tue, Apr 14, 2020 at 10:18 AM Pavel Stehule >> wrote: >> > >> > Hi >> > >> > I am testing some features from Postgres 13, and I am not sure if I >> > understand well to behave of EXPLAIN(ANALYZE, BUFFERS) >> > >> > When I run following statement first time in session I get >> > >> > postgres=# EXPLAIN (BUFFERS, ANALYZE) SELECT * FROM obce WHERE okres_id = >> > 'CZ0201'; >> > ┌──┐ >> > │ QUERY PLAN >> > │ >> > ╞══╡ >> > │ Index Scan using obce_okres_id_idx on obce (cost=0.28..14.49 rows=114 >> > width=41) (actual time=0.072..0.168 rows=114 loops=1) │ >> > │ Index Cond: ((okres_id)::text = 'CZ0201'::text) >> > │ >> > │ Buffers: shared hit=4 >> > │ >> > │ Planning Time: 0.539 ms >> > │ >> > │ Buffers: shared hit=13 >> > │ >> > │ Execution Time: 0.287 ms >> > │ >> > └──┘ >> > (6 rows) >> > >> > And I see share hit 13 in planning time. >> > >> > For second run I get >> > >> > postgres=# EXPLAIN (BUFFERS, ANALYZE) SELECT * FROM obce WHERE okres_id = >> > 'CZ0201'; >> > ┌──┐ >> > │ QUERY PLAN >> > │ >> > ╞══╡ >> > │ Index Scan using obce_okres_id_idx on obce (cost=0.28..14.49 rows=114 >> > width=41) (actual time=0.044..0.101 rows=114 loops=1) │ >> > │ Index Cond: ((okres_id)::text = 'CZ0201'::text) >> > │ >> > │ Buffers: shared hit=4 >> > │ >> > │ Planning Time: 0.159 ms >> > │ >> > │ Execution Time: 0.155 ms >> > │ >> > └──┘ >> > (5 rows) >> > >> > Now, there is not any touch in planning time. Does it mean so this all >> > these data are cached somewhere in session memory? >> >> The planning time is definitely shorter the 2nd time. And yes, what >> you see are all the catcache accesses that are initially performed on >> a fresh new backend. > > > One time Tom Lane mentioned using index in planning time for getting minimum > and maximum. I expected so these values are not cached. But I cannot to > reproduce it, and then I am little bit surprised so I don't see any hit in > second, and other executions. Isn't that get_actual_variable_range() purpose? If you use a plan that hit this function you'll definitely see consistent buffer usage during planning: rjuju=# explain (buffers, analyze) select * from pg_class c join pg_attribute a on a.attrelid = c.oid; QUERY PLAN --- Hash Join (cost=21.68..110.91 rows=2863 width=504) (actual time=0.393..5.989 rows=2863 loops=1) Hash Cond: (a.attrelid = c.oid) Buffers: shared hit=40 read=29 -> Seq Scan on pg_attribute a (cost=0.00..81.63 rows=2863 width=239) (actual time=0.010..0.773 rows=2863 loops=1) Buffers: shared hit=28 read=25 -> Hash (cost=16.86..16.86 rows=386 width=265) (actual time=0.333..0.334 rows=386 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 85kB Buffers: shared hit=9 read=4 -> Seq Scan on pg_class c (cost=0.00..16.86 rows=386 width=265) (actual time=0.004..0.123 rows=386 loop
Re: Display of buffers for planning time show nothing for second run
On Tue, Apr 14, 2020 at 10:40 AM Amit Langote wrote: > > On Tue, Apr 14, 2020 at 5:27 PM Julien Rouhaud wrote: > > On Tue, Apr 14, 2020 at 10:18 AM Pavel Stehule > > wrote: > > > For second run I get > > > > > > postgres=# EXPLAIN (BUFFERS, ANALYZE) SELECT * FROM obce WHERE okres_id = > > > 'CZ0201'; > > > ┌──┐ > > > │ QUERY PLAN > > > │ > > > ╞══╡ > > > │ Index Scan using obce_okres_id_idx on obce (cost=0.28..14.49 rows=114 > > > width=41) (actual time=0.044..0.101 rows=114 loops=1) │ > > > │ Index Cond: ((okres_id)::text = 'CZ0201'::text) > > > │ > > > │ Buffers: shared hit=4 > > > │ > > > │ Planning Time: 0.159 ms > > > │ > > > │ Execution Time: 0.155 ms > > > │ > > > └──┘ > > > (5 rows) > > > > > > Now, there is not any touch in planning time. Does it mean so this all > > > these data are cached somewhere in session memory? > > > > The planning time is definitely shorter the 2nd time. And yes, what > > you see are all the catcache accesses that are initially performed on > > a fresh new backend. > > By the way, even with all catcaches served from local memory, one may > still see shared buffers being hit during planning. For example: > > explain (buffers, analyze) select * from foo where a = 1; > QUERY PLAN > --- > Index Only Scan using foo_pkey on foo (cost=0.15..8.17 rows=1 > width=4) (actual time=0.010..0.011 rows=0 loops=1) >Index Cond: (a = 1) >Heap Fetches: 0 >Buffers: shared hit=2 > Planning Time: 0.775 ms >Buffers: shared hit=72 > Execution Time: 0.086 ms > (7 rows) > > Time: 2.477 ms > postgres=# explain (buffers, analyze) select * from foo where a = 1; > QUERY PLAN > --- > Index Only Scan using foo_pkey on foo (cost=0.15..8.17 rows=1 > width=4) (actual time=0.012..0.012 rows=0 loops=1) >Index Cond: (a = 1) >Heap Fetches: 0 >Buffers: shared hit=2 > Planning Time: 0.102 ms >Buffers: shared hit=1 > Execution Time: 0.047 ms > (7 rows) > > It seems that 1 Buffer hit comes from get_relation_info() doing > _bt_getrootheight() for that index on foo. Indeed. Having received some relcache invalidation should also lead to similar effect. Having those counters can help to quantify all of those interactions.
Re: index paths and enable_indexscan
On Tue, Apr 14, 2020 at 5:29 PM Andy Fan wrote: > On Tue, Apr 14, 2020 at 3:40 PM Amit Langote wrote: >> On Tue, Apr 14, 2020 at 4:13 PM Richard Guo wrote: >> > On Tue, Apr 14, 2020 at 2:44 PM Amit Langote >> > wrote: >> >> Maybe I am missing something obvious, but is it intentional that >> >> enable_indexscan is checked by cost_index(), that is, *after* creating >> >> an index path? I was expecting that if enable_indexscan is off, then >> >> no index paths would be generated to begin with, because I thought >> >> they are optional. >> > >> > I think the cost estimate of index paths is the same as other paths on >> > that setting enable_xxx to off only adds a penalty factor (disable_cost) >> > to the path's cost. The path would be still generated and compete with >> > other paths in add_path(). >> >> Yeah, but I am asking why build the path to begin with, as there will >> always be seq scan path for base rels. > > I guess that is because user may disable seqscan as well. If so, we > still need formula to decide with one to use, which requires index path > has to be calculated. but since disabling the two at the same time is rare, > we can ignore the index path build if user allow seqscan I am saying that instead of building index path with disabled cost, just don't build it at all. A base rel will always have a sequetial path, even though with disabled cost if enable_seqscan = off. -- Amit Langote EnterpriseDB: http://www.enterprisedb.com
Re: Lexer issues
Hello, On Mon, Apr 13, 2020 at 4:04 PM Patrick REED wrote: > > I am experimenting with postgres and am wondering if there is any tutorial on > how to properly add a new command to postgres. > > I want to add a new constraint on "CREATE ROLE" that requires an integer, it > has an identifier that is not a known (reserved or unreserved keyword) in > postgres, say we call it TestPatrick. In other words, I want to do this > "CREATE ROLE X TestPatrick=10". I am having an issue with having postgres > recognize my new syntax. > > I have seen this video: https://www.youtube.com/watch?v=uSEXTcEiXGQ and was > able to add have my postgres compile with my added word (modified gram.y, > kwlist.h, gram.cpp etc based on the video). However, when I use my syntax on > a client session, it still doesn't recognize my syntax... Are there any > specific lexer changes I need to make? I followed the example of CONNECTION > LIMIT and tried to mimic it for Create ROLE. I'd think that if you can get a successful compilation with a modified gram.y (and any kwlist change needed) the new syntax should be accepted (at least up to the parser, whether the utility command is properly handled is another thing), since there's a single version of the CreateRoleStmt. Is there any chance that you're somehow connecting to something else than the freshly make-install-ed binary, or that the error is coming from later stage than parsing?
Re: [PATCH] Keeps tracking the uniqueness with UniqueKey
On Fri, 3 Apr 2020 at 15:18, Andy Fan wrote: > All above 4 item Done. Just to explain my view on this going forward for PG14. I do plan to do a more thorough review of this soon. I wasn't so keen on pursuing this for PG13 as the skip scans patch [1] needs to use the same infrastructure this patch has added and it does not, yet. The infrastructure (knowing the unique properties of a RelOptInfo), as provided by the patch Andy has been working on, which is based on my rough prototype version, I believe should be used for the skip scans patch as well. I understand that as skip scans currently stands, Jasper has done quite a bit of work to add the UniqueKeys, however, this was unfortunately based on some early description of UniqueKeys where I had thought that we could just store EquivalenceClasses. I no longer think that's the case, and I believe the implementation that we require is something more along the lines of Andy's latest version of the patch. However, I've not quite stared at it long enough to be highly confident in that. I'd like to strike up a bit of a plan to move both Andy's work and the Skip scans work forward for PG14. Here are my thoughts so far: 1. Revise v4 of remove DISTINCT patch to split the patch into two pieces. 0001 should add the UniqueKey code but not any additional planner smarts to use them (i.e remove GROUP BY / DISTINCT) elimination parts. Join removals and Unique joins should use UniqueKeys in this patch. 0002 should add back the GROUP BY / DISTINCT smarts and add whatever tests should be added for that and include updating existing expected results and modifying any tests which no longer properly test what they're meant to be testing. I've done this with the attached patch. 2. David / Jesper to look at 0001 and build or align the existing skip scans 0001 patch to make use of Andy's 0001 patch. This will require tagging UniqueKeys onto Paths, not just RelOptInfos, plus a bunch of other work. Clearly UniqueKeys must suit both needs and since we have two different implementations each providing some subset of the features, then clearly we're not yet ready to move both skip scans and this patch forward together. We need to align that and move both patches forward together. Hopefully, the attached 0001 patch helps move that along. While I'm here, a quick review of Andy's v4 patch. I didn't address any of this in the attached v5. These are only based on what I saw when shuffling some code around. It's not an in-depth review. 1. Out of date comment in join.sql -- join removal is not possible when the GROUP BY contains a column that is -- not in the join condition. (Note: as of 9.6, we notice that b.id is a -- primary key and so drop b.c_id from the GROUP BY of the resulting plan; -- but this happens too late for join removal in the outer plan level.) explain (costs off) select d.* from d left join (select d, c_id from b group by b.d, b.c_id) s on d.a = s.d; You've changed the GROUP BY clause so it does not include b.id, so the Note in the comment is now misleading. 2. I think 0002 is overly restrictive in its demands that parse->hasAggs must be false. We should be able to just use a Group Aggregate with unsorted input when the input_rel is unique on the GROUP BY clause. This will save on hashing and sorting. Basically similar to what we do for when a query contains aggregates without any GROUP BY. 3. I don't quite understand why you changed this to a right join: -- Test case where t1 can be optimized but not t2 explain (costs off) select t1.*,t2.x,t2.z -from t1 inner join t2 on t1.a = t2.x and t1.b = t2.y +from t1 right join t2 on t1.a = t2.x and t1.b = t2.y Perhaps this change is left over from some previous version of the patch? [1] https://commitfest.postgresql.org/27/1741/ v5-0001-Introduce-UniqueKeys-to-determine-RelOptInfo-uniq.patch Description: Binary data v5-0002-Skip-DISTINCT-GROUP-BY-if-input-is-already-unique.patch Description: Binary data
Re: index paths and enable_indexscan
On Tue, Apr 14, 2020 at 4:58 PM Amit Langote wrote: > On Tue, Apr 14, 2020 at 5:29 PM Andy Fan wrote: > > On Tue, Apr 14, 2020 at 3:40 PM Amit Langote > wrote: > >> On Tue, Apr 14, 2020 at 4:13 PM Richard Guo > wrote: > >> > On Tue, Apr 14, 2020 at 2:44 PM Amit Langote > wrote: > >> >> Maybe I am missing something obvious, but is it intentional that > >> >> enable_indexscan is checked by cost_index(), that is, *after* > creating > >> >> an index path? I was expecting that if enable_indexscan is off, then > >> >> no index paths would be generated to begin with, because I thought > >> >> they are optional. > >> > > >> > I think the cost estimate of index paths is the same as other paths on > >> > that setting enable_xxx to off only adds a penalty factor > (disable_cost) > >> > to the path's cost. The path would be still generated and compete with > >> > other paths in add_path(). > >> > >> Yeah, but I am asking why build the path to begin with, as there will > >> always be seq scan path for base rels. > > > > I guess that is because user may disable seqscan as well. If so, we > > still need formula to decide with one to use, which requires index path > > has to be calculated. but since disabling the two at the same time is > rare, > > we can ignore the index path build if user allow seqscan > > I am saying that instead of building index path with disabled cost, > just don't build it at all. A base rel will always have a sequetial > path, even though with disabled cost if enable_seqscan = off. > Let's say user set enable_seqscan=off and set enable_indexscan=off; will you expect user to get seqscan at last? If so, why is seqscan (rather than index scan) since both are disabled by user equally? > Amit Langote > EnterpriseDB: http://www.enterprisedb.com >
Re: Index Skip Scan
On Wed, 8 Apr 2020 at 07:40, Dmitry Dolgov <9erthali...@gmail.com> wrote: > Other than that to summarize current open points for future readers > (this thread somehow became quite big): > > * Making UniqueKeys usage more generic to allow using skip scan for more > use cases (hopefully it was covered by the v33, but I still need a > confirmation from David, like blinking twice or something). I've not yet looked at the latest patch, but I did put some thoughts into an email on the other thread that's been discussing UniqueKeys [1]. I'm keen to hear thoughts on the plan I mentioned over there. Likely it would be best to discuss the specifics of what additional features we need to add to UniqueKeys for skip scans over here, but discuss any chances which affect both patches over there. We certainly can't have two separate implementations of UniqueKeys, so I believe the skip scans UniqueKeys patch should most likely be based on the one in [1] or some descendant of it. [1] https://www.postgresql.org/message-id/CAApHDvpx1qED1uLqubcKJ=ohatcmd7ptukkdq0b72_08nbr...@mail.gmail.com
Re: index paths and enable_indexscan
On Tue, Apr 14, 2020 at 5:12 PM Andy Fan wrote: > > > On Tue, Apr 14, 2020 at 4:58 PM Amit Langote > wrote: > >> On Tue, Apr 14, 2020 at 5:29 PM Andy Fan >> wrote: >> > On Tue, Apr 14, 2020 at 3:40 PM Amit Langote >> wrote: >> >> On Tue, Apr 14, 2020 at 4:13 PM Richard Guo >> wrote: >> >> > On Tue, Apr 14, 2020 at 2:44 PM Amit Langote < >> amitlangot...@gmail.com> wrote: >> >> >> Maybe I am missing something obvious, but is it intentional that >> >> >> enable_indexscan is checked by cost_index(), that is, *after* >> creating >> >> >> an index path? I was expecting that if enable_indexscan is off, >> then >> >> >> no index paths would be generated to begin with, because I thought >> >> >> they are optional. >> >> > >> >> > I think the cost estimate of index paths is the same as other paths >> on >> >> > that setting enable_xxx to off only adds a penalty factor >> (disable_cost) >> >> > to the path's cost. The path would be still generated and compete >> with >> >> > other paths in add_path(). >> >> >> >> Yeah, but I am asking why build the path to begin with, as there will >> >> always be seq scan path for base rels. >> > >> > I guess that is because user may disable seqscan as well. If so, we >> > still need formula to decide with one to use, which requires index path >> > has to be calculated. but since disabling the two at the same time is >> rare, >> > we can ignore the index path build if user allow seqscan >> >> I am saying that instead of building index path with disabled cost, >> just don't build it at all. A base rel will always have a sequetial >> path, even though with disabled cost if enable_seqscan = off. >> > > Let's say user set enable_seqscan=off and set enable_indexscan=off; > will you expect user to get seqscan at last? If so, why is seqscan > (rather than index scan) since both are disabled by user equally? > > The following test should demonstrate what I think. demo=# create table t(a int); CREATE TABLE demo=# insert into t select generate_series(1, 1000); INSERT 0 1000 demo=# create index t_a on t(a); CREATE INDEX demo=# analyze t; ANALYZE demo=# set enable_seqscan to off; SET demo=# set enable_indexscan to off; SET demo=# set enable_bitmapscan to off; SET demo=# set enable_indexonlyscan to off; SET demo=# explain select * from t where a = 1; QUERY PLAN - Index Scan using t_a on t (cost=100.43..108.45 rows=1 width=4) Index Cond: (a = 1) (2 rows) If we just disable index path, we will get seqscan at last. Regards Andy Fan >> Amit Langote >> EnterpriseDB: http://www.enterprisedb.com >> >
Re: Display of buffers for planning time show nothing for second run
út 14. 4. 2020 v 10:40 odesílatel Amit Langote napsal: > On Tue, Apr 14, 2020 at 5:27 PM Julien Rouhaud wrote: > > On Tue, Apr 14, 2020 at 10:18 AM Pavel Stehule > wrote: > > > For second run I get > > > > > > postgres=# EXPLAIN (BUFFERS, ANALYZE) SELECT * FROM obce WHERE > okres_id = 'CZ0201'; > > > > ┌──┐ > > > │ QUERY PLAN > │ > > > > ╞══╡ > > > │ Index Scan using obce_okres_id_idx on obce (cost=0.28..14.49 > rows=114 width=41) (actual time=0.044..0.101 rows=114 loops=1) │ > > > │ Index Cond: ((okres_id)::text = 'CZ0201'::text) > │ > > > │ Buffers: shared hit=4 > │ > > > │ Planning Time: 0.159 ms > │ > > > │ Execution Time: 0.155 ms > │ > > > > └──┘ > > > (5 rows) > > > > > > Now, there is not any touch in planning time. Does it mean so this all > these data are cached somewhere in session memory? > > > > The planning time is definitely shorter the 2nd time. And yes, what > > you see are all the catcache accesses that are initially performed on > > a fresh new backend. > > By the way, even with all catcaches served from local memory, one may > still see shared buffers being hit during planning. For example: > > explain (buffers, analyze) select * from foo where a = 1; > QUERY PLAN > > --- > Index Only Scan using foo_pkey on foo (cost=0.15..8.17 rows=1 > width=4) (actual time=0.010..0.011 rows=0 loops=1) >Index Cond: (a = 1) >Heap Fetches: 0 >Buffers: shared hit=2 > Planning Time: 0.775 ms >Buffers: shared hit=72 > Execution Time: 0.086 ms > (7 rows) > > Time: 2.477 ms > postgres=# explain (buffers, analyze) select * from foo where a = 1; > QUERY PLAN > > --- > Index Only Scan using foo_pkey on foo (cost=0.15..8.17 rows=1 > width=4) (actual time=0.012..0.012 rows=0 loops=1) >Index Cond: (a = 1) >Heap Fetches: 0 >Buffers: shared hit=2 > Planning Time: 0.102 ms >Buffers: shared hit=1 > Execution Time: 0.047 ms > (7 rows) > > It seems that 1 Buffer hit comes from get_relation_info() doing > _bt_getrootheight() for that index on foo. > unfortunatelly, I cannot to repeat it. create table foo(a int); create index on foo(a); insert into foo values(1); analyze foo; for this case any second EXPLAIN is without buffer on my comp > -- > > Amit Langote > EnterpriseDB: http://www.enterprisedb.com >
Re: Display of buffers for planning time show nothing for second run
út 14. 4. 2020 v 10:49 odesílatel Julien Rouhaud napsal: > On Tue, Apr 14, 2020 at 10:36 AM Pavel Stehule > wrote: > > > > út 14. 4. 2020 v 10:27 odesílatel Julien Rouhaud > napsal: > >> > >> Hi, > >> > >> On Tue, Apr 14, 2020 at 10:18 AM Pavel Stehule > wrote: > >> > > >> > Hi > >> > > >> > I am testing some features from Postgres 13, and I am not sure if I > understand well to behave of EXPLAIN(ANALYZE, BUFFERS) > >> > > >> > When I run following statement first time in session I get > >> > > >> > postgres=# EXPLAIN (BUFFERS, ANALYZE) SELECT * FROM obce WHERE > okres_id = 'CZ0201'; > >> > > ┌──┐ > >> > │ QUERY > PLAN │ > >> > > ╞══╡ > >> > │ Index Scan using obce_okres_id_idx on obce (cost=0.28..14.49 > rows=114 width=41) (actual time=0.072..0.168 rows=114 loops=1) │ > >> > │ Index Cond: ((okres_id)::text = 'CZ0201'::text) > │ > >> > │ Buffers: shared hit=4 > │ > >> > │ Planning Time: 0.539 ms > │ > >> > │ Buffers: shared hit=13 > │ > >> > │ Execution Time: 0.287 ms > │ > >> > > └──┘ > >> > (6 rows) > >> > > >> > And I see share hit 13 in planning time. > >> > > >> > For second run I get > >> > > >> > postgres=# EXPLAIN (BUFFERS, ANALYZE) SELECT * FROM obce WHERE > okres_id = 'CZ0201'; > >> > > ┌──┐ > >> > │ QUERY > PLAN │ > >> > > ╞══╡ > >> > │ Index Scan using obce_okres_id_idx on obce (cost=0.28..14.49 > rows=114 width=41) (actual time=0.044..0.101 rows=114 loops=1) │ > >> > │ Index Cond: ((okres_id)::text = 'CZ0201'::text) > │ > >> > │ Buffers: shared hit=4 > │ > >> > │ Planning Time: 0.159 ms > │ > >> > │ Execution Time: 0.155 ms > │ > >> > > └──┘ > >> > (5 rows) > >> > > >> > Now, there is not any touch in planning time. Does it mean so this > all these data are cached somewhere in session memory? > >> > >> The planning time is definitely shorter the 2nd time. And yes, what > >> you see are all the catcache accesses that are initially performed on > >> a fresh new backend. > > > > > > One time Tom Lane mentioned using index in planning time for getting > minimum and maximum. I expected so these values are not cached. But I > cannot to reproduce it, and then I am little bit surprised so I don't see > any hit in second, and other executions. > > Isn't that get_actual_variable_range() purpose? If you use a plan > that hit this function you'll definitely see consistent buffer usage > during planning: > > rjuju=# explain (buffers, analyze) select * from pg_class c join > pg_attribute a on a.attrelid = c.oid; > QUERY PLAN > > --- > Hash Join (cost=21.68..110.91 rows=2863 width=504) (actual > time=0.393..5.989 rows=2863 loops=1) >Hash Cond: (a.attrelid = c.oid) >Buffers: shared hit=40 read=29 >-> Seq Scan on pg_attribute a (cost=0.00..81.63 rows=2863 > width=239) (actual time=0.010..0.773 rows=2863 loops=1) > Buffers: shared hit=28 read=25 >-> Hash (cost=16.86..16.86 rows=386 width=265) (actual > time=0.333..0.334 rows=386 loops=1) > Buckets: 1024 Batches: 1 Memory Usage: 85kB > Buffers: shared hit=9 read=4 > -> Seq Scan on pg_class c (cost=0.00..16.86 rows=386 > width=265) (actual time=0.004..0.123 rows=386 loops=1) >Buffers: shared hit=9 read=4 > Planning Time: 2.709 ms >Buffers: shared hit=225 read=33 > Execution Time: 6.529 ms > (13 rows) > > rjuju=# explain (buffers, ana
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Mon, Apr 13, 2020 at 6:12 PM Kuntal Ghosh wrote: > > On Mon, Apr 13, 2020 at 5:20 PM Dilip Kumar wrote: > > On Mon, Apr 13, 2020 at 4:14 PM Kuntal Ghosh > > wrote: > > > > > > +#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char)) > > > This looks wrong. We should change the name of this Macro or we can > > > add the 1 byte directly in HEADER_SCRATCH_SIZE and some comments. > > > > I think this is in sync with below code (SizeOfXlogOrigin), SO doen't > > make much sense to add different terminology no? > > #define SizeOfXlogOrigin (sizeof(RepOriginId) + sizeof(char)) > > +#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char)) > > > In that case, we can rename this, for example, SizeOfXLogTransactionId. > > Some review comments from 0002-Issue-individual-*.path, > > +void > +ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid, > + XLogRecPtr lsn, int nmsgs, > + SharedInvalidationMessage *msgs) > +{ > + MemoryContext oldcontext; > + ReorderBufferChange *change; > + > + /* XXX Should we even write invalidations without valid XID? */ > + if (xid == InvalidTransactionId) > + return; > + > + Assert(xid != InvalidTransactionId); > > It seems we don't call the function if xid is not valid. In fact, > You have a valid point. Also, it is not clear if we are first checking (xid == InvalidTransactionId) and returning from the function, how can even Assert hit. > @@ -281,6 +281,24 @@ DecodeXactOp(LogicalDecodingContext *ctx, > XLogRecordBuffer *buf) > } > case XLOG_XACT_ASSIGNMENT: > break; > + case XLOG_XACT_INVALIDATIONS: > + { > + TransactionId xid; > + xl_xact_invalidations *invals; > + > + xid = XLogRecGetXid(r); > + invals = (xl_xact_invalidations *) XLogRecGetData(r); > + > + if (!TransactionIdIsValid(xid)) > + break; > + > + ReorderBufferAddInvalidation(reorder, xid, buf->origptr, > + invals->nmsgs, invals->msgs); > > Why should we insert an WAL record for such cases? > Right, if there is any such case, we should avoid it. One more point about this patch, the commit message needs to be updated: > The new invalidations are written to WAL immediately, without any such caching. Perhaps it would be possible to add similar caching, > e.g. at the command level, or something like that? I think the above part of commit message is not right as the patch already does such a caching now at the command level. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Display of buffers for planning time show nothing for second run
On Tue, Apr 14, 2020 at 11:25 AM Pavel Stehule wrote: > > út 14. 4. 2020 v 10:40 odesílatel Amit Langote > napsal: >> >> On Tue, Apr 14, 2020 at 5:27 PM Julien Rouhaud wrote: >> > On Tue, Apr 14, 2020 at 10:18 AM Pavel Stehule >> > wrote: >> > > For second run I get >> > > >> > > postgres=# EXPLAIN (BUFFERS, ANALYZE) SELECT * FROM obce WHERE okres_id >> > > = 'CZ0201'; >> > > ┌──┐ >> > > │ QUERY PLAN >> > > │ >> > > ╞══╡ >> > > │ Index Scan using obce_okres_id_idx on obce (cost=0.28..14.49 rows=114 >> > > width=41) (actual time=0.044..0.101 rows=114 loops=1) │ >> > > │ Index Cond: ((okres_id)::text = 'CZ0201'::text) >> > > │ >> > > │ Buffers: shared hit=4 >> > > │ >> > > │ Planning Time: 0.159 ms >> > > │ >> > > │ Execution Time: 0.155 ms >> > > │ >> > > └──┘ >> > > (5 rows) >> > > >> > > Now, there is not any touch in planning time. Does it mean so this all >> > > these data are cached somewhere in session memory? >> > >> > The planning time is definitely shorter the 2nd time. And yes, what >> > you see are all the catcache accesses that are initially performed on >> > a fresh new backend. >> >> By the way, even with all catcaches served from local memory, one may >> still see shared buffers being hit during planning. For example: >> >> explain (buffers, analyze) select * from foo where a = 1; >> QUERY PLAN >> --- >> Index Only Scan using foo_pkey on foo (cost=0.15..8.17 rows=1 >> width=4) (actual time=0.010..0.011 rows=0 loops=1) >>Index Cond: (a = 1) >>Heap Fetches: 0 >>Buffers: shared hit=2 >> Planning Time: 0.775 ms >>Buffers: shared hit=72 >> Execution Time: 0.086 ms >> (7 rows) >> >> Time: 2.477 ms >> postgres=# explain (buffers, analyze) select * from foo where a = 1; >> QUERY PLAN >> --- >> Index Only Scan using foo_pkey on foo (cost=0.15..8.17 rows=1 >> width=4) (actual time=0.012..0.012 rows=0 loops=1) >>Index Cond: (a = 1) >>Heap Fetches: 0 >>Buffers: shared hit=2 >> Planning Time: 0.102 ms >>Buffers: shared hit=1 >> Execution Time: 0.047 ms >> (7 rows) >> >> It seems that 1 Buffer hit comes from get_relation_info() doing >> _bt_getrootheight() for that index on foo. > > > unfortunatelly, I cannot to repeat it. > > create table foo(a int); > create index on foo(a); > insert into foo values(1); > analyze foo; > > for this case any second EXPLAIN is without buffer on my comp _bt_getrootheight() won't cache any value if the index is totally empty. Removing the INSERT in your example should lead to Amit's behavior.
Re: Race condition in SyncRepGetSyncStandbysPriority
At Tue, 14 Apr 2020 13:06:14 +0900, Masahiko Sawada wrote in > On Tue, 14 Apr 2020 at 10:34, Tom Lane wrote: > > > > Kyotaro Horiguchi writes: > > > At Sat, 11 Apr 2020 18:30:30 -0400, Tom Lane wrote in > > >> What I think we should do about this is, essentially, to get rid of > > >> SyncRepGetSyncStandbys. Instead, let's have each walsender advertise > > >> whether *it* believes that it is a sync standby, based on its last > > >> evaluation of the relevant GUCs. This would be a bool that it'd > > >> compute and set alongside sync_standby_priority. (Hm, maybe we'd not > > > > > Mmm.. SyncRepGetStandbyPriority returns the "priority" that a > > > walsender thinks it is at, among synchronous_standby_names. Then to > > > decide "I am a sync standby" we need to know how many walsenders with > > > higher priority are alive now. SyncRepGetSyncStandbyPriority does the > > > judgment now and suffers from the inconsistency of priority values. > > > > Yeah. After looking a bit closer, I think that the current definition > > of sync_standby_priority (that is, as the result of local evaluation > > of SyncRepGetStandbyPriority()) is OK. The problem is what we're doing > > with it. I suggest that what we should do in SyncRepGetSyncRecPtr() > > is make one sweep across the WalSnd array, collecting PID, > > sync_standby_priority, *and* the WAL pointers from each valid entry. > > Then examine that data and decide which WAL value we need, without assuming > > that the sync_standby_priority values are necessarily totally consistent. > > But in any case we must examine each entry just once while holding its > > mutex, not go back to it later expecting it to still be the same. SyncRepGetSyncStandbysPriority() is runing holding SyncRepLock so sync_standby_priority of any walsender can be changed while the function is scanning welsenders. The issue is we already have inconsistent walsender information before we enter the function. Thus how many times we scan on the array doesn't make any difference. I think we need to do one of the followings. A) prevent SyncRepGetSyncStandbysPriority from being entered while walsender priority is inconsistent. B) make SyncRepGetSyncStandbysPriority be tolerant of priority inconsistency. C) protect walsender priority array from beinig inconsistent. The (B) is the band aids. To achieve A we need to central controller of priority config handling. C is: > Can we have a similar approach of sync_standby_defined for > sync_standby_priority? That is, checkpionter is responsible for > changing sync_standby_priority of all walsenders when SIGHUP. That > way, all walsenders can see a consistent view of > sync_standby_priority. And when a walsender starts, it sets > sync_standby_priority by itself. The logic to decide who's a sync > standby doesn't change. SyncRepGetSyncRecPtr() gets all walsenders > having higher priority along with their WAL positions. Yeah, it works if we do , but the problem of that way is that to determin priority of walsenders, we need to know what walsenders are running. That is, when new walsender comes the process needs to aware of the arrival (or leaving) right away and reassign the priority of every wal senders again. If we accept to share variable-length information among processes, sharing sync_standby_names or parsed SyncRepConfigData among processes would work. > > > > Another thing that I'm finding interesting is that I now see this is > > not at all new code. It doesn't look like SyncRepGetSyncStandbysPriority > > has changed much since 2016. So how come we didn't detect this problem > > long ago? I searched the buildfarm logs for assertion failures in > > syncrep.c, looking back one year, and here's what I found: ... > > The line numbers vary in the back branches, but all of these crashes are > > at that same Assert. So (a) yes, this does happen in the back branches, > > but (b) some fairly recent change has made it a whole lot more probable. > > Neither syncrep.c nor 007_sync_rep.pl have changed much in some time, > > so whatever the change was was indirect. Curious. Is it just timing? > > Interesting. It's happening on certain animals, not all. Especially > tests with HEAD on sidewinder and curculio, which are NetBSD 7 and > OpenBSD 5.9 respectively, started to fail at a high rate since a > couple of days ago. Coundn't this align the timing of config reloading? (I didn't checked anything yet.) | commit 421685812290406daea58b78dfab0346eb683bbb | Author: Noah Misch | Date: Sat Apr 11 10:30:00 2020 -0700 | | When WalSndCaughtUp, sleep only in WalSndWaitForWal(). |Before sleeping, WalSndWaitForWal() sends a keepalive if MyWalSnd->write |< sentPtr. That is important in logical replication. When the latest |physical LSN yields no logical replication messages (a common case), |that keepalive elicits a reply, and processing the reply updates |pg_stat_replication.replay_lsn. WalSndLoop() lacks that; wh
Re: Display of buffers for planning time show nothing for second run
út 14. 4. 2020 v 11:35 odesílatel Julien Rouhaud napsal: > On Tue, Apr 14, 2020 at 11:25 AM Pavel Stehule > wrote: > > > > út 14. 4. 2020 v 10:40 odesílatel Amit Langote > napsal: > >> > >> On Tue, Apr 14, 2020 at 5:27 PM Julien Rouhaud > wrote: > >> > On Tue, Apr 14, 2020 at 10:18 AM Pavel Stehule < > pavel.steh...@gmail.com> wrote: > >> > > For second run I get > >> > > > >> > > postgres=# EXPLAIN (BUFFERS, ANALYZE) SELECT * FROM obce WHERE > okres_id = 'CZ0201'; > >> > > > ┌──┐ > >> > > │ QUERY > PLAN │ > >> > > > ╞══╡ > >> > > │ Index Scan using obce_okres_id_idx on obce (cost=0.28..14.49 > rows=114 width=41) (actual time=0.044..0.101 rows=114 loops=1) │ > >> > > │ Index Cond: ((okres_id)::text = 'CZ0201'::text) > │ > >> > > │ Buffers: shared hit=4 > │ > >> > > │ Planning Time: 0.159 ms > │ > >> > > │ Execution Time: 0.155 ms >│ > >> > > > └──┘ > >> > > (5 rows) > >> > > > >> > > Now, there is not any touch in planning time. Does it mean so this > all these data are cached somewhere in session memory? > >> > > >> > The planning time is definitely shorter the 2nd time. And yes, what > >> > you see are all the catcache accesses that are initially performed on > >> > a fresh new backend. > >> > >> By the way, even with all catcaches served from local memory, one may > >> still see shared buffers being hit during planning. For example: > >> > >> explain (buffers, analyze) select * from foo where a = 1; > >> QUERY PLAN > >> > --- > >> Index Only Scan using foo_pkey on foo (cost=0.15..8.17 rows=1 > >> width=4) (actual time=0.010..0.011 rows=0 loops=1) > >>Index Cond: (a = 1) > >>Heap Fetches: 0 > >>Buffers: shared hit=2 > >> Planning Time: 0.775 ms > >>Buffers: shared hit=72 > >> Execution Time: 0.086 ms > >> (7 rows) > >> > >> Time: 2.477 ms > >> postgres=# explain (buffers, analyze) select * from foo where a = 1; > >> QUERY PLAN > >> > --- > >> Index Only Scan using foo_pkey on foo (cost=0.15..8.17 rows=1 > >> width=4) (actual time=0.012..0.012 rows=0 loops=1) > >>Index Cond: (a = 1) > >>Heap Fetches: 0 > >>Buffers: shared hit=2 > >> Planning Time: 0.102 ms > >>Buffers: shared hit=1 > >> Execution Time: 0.047 ms > >> (7 rows) > >> > >> It seems that 1 Buffer hit comes from get_relation_info() doing > >> _bt_getrootheight() for that index on foo. > > > > > > unfortunatelly, I cannot to repeat it. > > > > create table foo(a int); > > create index on foo(a); > > insert into foo values(1); > > analyze foo; > > > > for this case any second EXPLAIN is without buffer on my comp > > _bt_getrootheight() won't cache any value if the index is totally > empty. Removing the INSERT in your example should lead to Amit's > behavior. > aha. good to know it. Thank you Pavel
Re: [HACKERS] make async slave to wait for lsn to be replayed
On 2020-04-11 00:44, Andres Freund wrote: I think there's also some advantages of having it in a single statement for poolers. If a pooler analyzes BEGIN AFTER 'xx/xx' it could e.g. redirect the transaction to a node that's caught up far enough, instead of blocking. But that can't work even close to as easily if it's something that has to be executed before transaction begin. I think that's a good point. Also, I'm not sure how we'd expect a wait-for-LSN procedure to work inside a single-snapshot transaction. Would it throw an error inside a RR transaction block? Would it give a warning? -- Anna Akenteva Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Tue, Apr 14, 2020 at 2:57 PM Amit Kapila wrote: > > On Mon, Apr 13, 2020 at 6:12 PM Kuntal Ghosh > wrote: > > > > On Mon, Apr 13, 2020 at 5:20 PM Dilip Kumar wrote: > > > On Mon, Apr 13, 2020 at 4:14 PM Kuntal Ghosh > > > wrote: > > > > > > > > +#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char)) > > > > This looks wrong. We should change the name of this Macro or we can > > > > add the 1 byte directly in HEADER_SCRATCH_SIZE and some comments. > > > > > > I think this is in sync with below code (SizeOfXlogOrigin), SO doen't > > > make much sense to add different terminology no? > > > #define SizeOfXlogOrigin (sizeof(RepOriginId) + sizeof(char)) > > > +#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char)) > > > > > In that case, we can rename this, for example, SizeOfXLogTransactionId. > > > > Some review comments from 0002-Issue-individual-*.path, > > > > +void > > +ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid, > > + XLogRecPtr lsn, int nmsgs, > > + SharedInvalidationMessage *msgs) > > +{ > > + MemoryContext oldcontext; > > + ReorderBufferChange *change; > > + > > + /* XXX Should we even write invalidations without valid XID? */ > > + if (xid == InvalidTransactionId) > > + return; > > + > > + Assert(xid != InvalidTransactionId); > > > > It seems we don't call the function if xid is not valid. In fact, > > > > You have a valid point. Also, it is not clear if we are first > checking (xid == InvalidTransactionId) and returning from the > function, how can even Assert hit. I have changed to code, now we only have an assert. > > > @@ -281,6 +281,24 @@ DecodeXactOp(LogicalDecodingContext *ctx, > > XLogRecordBuffer *buf) > > } > > case XLOG_XACT_ASSIGNMENT: > > break; > > + case XLOG_XACT_INVALIDATIONS: > > + { > > + TransactionId xid; > > + xl_xact_invalidations *invals; > > + > > + xid = XLogRecGetXid(r); > > + invals = (xl_xact_invalidations *) XLogRecGetData(r); > > + > > + if (!TransactionIdIsValid(xid)) > > + break; > > + > > + ReorderBufferAddInvalidation(reorder, xid, buf->origptr, > > + invals->nmsgs, invals->msgs); > > > > Why should we insert an WAL record for such cases? > > > > Right, if there is any such case, we should avoid it. I think we don't have any such case because we are logging at the command end. So I have created an assert instead of the check. > One more point about this patch, the commit message needs to be updated: > > > The new invalidations are written to WAL immediately, without any > such caching. Perhaps it would be possible to add similar caching, > > e.g. at the command level, or something like that? > > I think the above part of commit message is not right as the patch > already does such a caching now at the command level. Right, I have removed that. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Tue, Apr 14, 2020 at 3:41 PM Dilip Kumar wrote: > > > > > > > @@ -281,6 +281,24 @@ DecodeXactOp(LogicalDecodingContext *ctx, > > > XLogRecordBuffer *buf) > > > } > > > case XLOG_XACT_ASSIGNMENT: > > > break; > > > + case XLOG_XACT_INVALIDATIONS: > > > + { > > > + TransactionId xid; > > > + xl_xact_invalidations *invals; > > > + > > > + xid = XLogRecGetXid(r); > > > + invals = (xl_xact_invalidations *) XLogRecGetData(r); > > > + > > > + if (!TransactionIdIsValid(xid)) > > > + break; > > > + > > > + ReorderBufferAddInvalidation(reorder, xid, buf->origptr, > > > + invals->nmsgs, invals->msgs); > > > > > > Why should we insert an WAL record for such cases? > > > > > > > Right, if there is any such case, we should avoid it. > > I think we don't have any such case because we are logging at the > command end. So I have created an assert instead of the check. > Have you tried to ensure this in some way? One idea could be to add an Assert (to check if transaction id is assigned) in the new code where you are writing WAL for this action and then run make check-world and or make installcheck-world. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Tue, Apr 14, 2020 at 3:57 PM Amit Kapila wrote: > > On Tue, Apr 14, 2020 at 3:41 PM Dilip Kumar wrote: > > > > > > > > > > > @@ -281,6 +281,24 @@ DecodeXactOp(LogicalDecodingContext *ctx, > > > > XLogRecordBuffer *buf) > > > > } > > > > case XLOG_XACT_ASSIGNMENT: > > > > break; > > > > + case XLOG_XACT_INVALIDATIONS: > > > > + { > > > > + TransactionId xid; > > > > + xl_xact_invalidations *invals; > > > > + > > > > + xid = XLogRecGetXid(r); > > > > + invals = (xl_xact_invalidations *) XLogRecGetData(r); > > > > + > > > > + if (!TransactionIdIsValid(xid)) > > > > + break; > > > > + > > > > + ReorderBufferAddInvalidation(reorder, xid, buf->origptr, > > > > + invals->nmsgs, invals->msgs); > > > > > > > > Why should we insert an WAL record for such cases? > > > > > > > > > > Right, if there is any such case, we should avoid it. > > > > I think we don't have any such case because we are logging at the > > command end. So I have created an assert instead of the check. > > > > Have you tried to ensure this in some way? One idea could be to add > an Assert (to check if transaction id is assigned) in the new code > where you are writing WAL for this action and then run make > check-world and or make installcheck-world. Yeah, I had already tested that. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Poll: are people okay with function/operator table redesign?
On 4/13/20 7:55 PM, Tom Lane wrote: > "Jonathan S. Katz" writes: >> On 4/13/20 7:02 PM, Jonathan S. Katz wrote: >>> Perhaps a counterproposal: We eliminate the content in the leftmost >>> "function column, but leave that there to allow the function name / >>> signature to span the full 3 columns. Then the rest of the info goes >>> below. This will also compress the table height down a bit. >> An attempt at a "POC" of what I'm describing (attached image). > Hmm ... what is determining the width of the left-hand column? > It doesn't seem to have any content, since the function entries > are being spanned across the whole table. > > I think the main practical problem though is that it wouldn't > work nicely for operators, since the key "name" you'd be looking > for would not be at the left of the signature line. I suppose we > don't necessarily have to have the same layout for operators as > for functions, but it feels like it'd be jarringly inconsistent. > > Maybe highlight the item by bolding or colour? cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: WIP/PoC for parallel backup
Hi Asif Getting the following error on Parallel backup when --no-manifest option is used. [edb@localhost bin]$ [edb@localhost bin]$ [edb@localhost bin]$ ./pg_basebackup -v -j 5 -D /home/edb/Desktop/backup/ --no-manifest pg_basebackup: initiating base backup, waiting for checkpoint to complete pg_basebackup: checkpoint completed pg_basebackup: write-ahead log start point: 0/228 on timeline 1 pg_basebackup: starting background WAL receiver pg_basebackup: created temporary replication slot "pg_basebackup_10223" pg_basebackup: backup worker (0) created pg_basebackup: backup worker (1) created pg_basebackup: backup worker (2) created pg_basebackup: backup worker (3) created pg_basebackup: backup worker (4) created pg_basebackup: write-ahead log end point: 0/2000100 pg_basebackup: error: could not get data for 'BUILD_MANIFEST': ERROR: could not open file "base/pgsql_tmp/pgsql_tmp_b4ef5ac0fd150b2a28caf626bbb1bef2.1": No such file or directory pg_basebackup: removing contents of data directory "/home/edb/Desktop/backup/" [edb@localhost bin]$ Thanks On Tue, Apr 14, 2020 at 5:33 PM Asif Rehman wrote: > > > On Wed, Apr 8, 2020 at 6:53 PM Kashif Zeeshan < > kashif.zees...@enterprisedb.com> wrote: > >> >> >> On Tue, Apr 7, 2020 at 9:44 PM Asif Rehman >> wrote: >> >>> Hi, >>> >>> Thanks, Kashif and Rajkumar. I have fixed the reported issues. >>> >>> I have added the shared state as previously described. The new grammar >>> changes >>> are as follows: >>> >>> START_BACKUP [LABEL ''] [FAST] [MAX_RATE %d] >>> - This will generate a unique backupid using pg_strong_random(16) >>> and hex-encoded >>> it. which is then returned as the result set. >>> - It will also create a shared state and add it to the hashtable. >>> The hash table size is set >>> to BACKUP_HASH_SIZE=10, but since hashtable can expand >>> dynamically, I think it's >>> sufficient initial size. max_wal_senders is not used, because it >>> can be set to quite a >>> large values. >>> >>> JOIN_BACKUP 'backup_id' >>> - finds 'backup_id' in hashtable and attaches it to server process. >>> >>> >>> SEND_FILE '(' 'FILE' ')' [NOVERIFY_CHECKSUMS] >>> - renamed SEND_FILES to SEND_FILE >>> - removed START_WAL_LOCATION from this because 'startptr' is now >>> accessible through >>> shared state. >>> >>> There is no change in other commands: >>> STOP_BACKUP [NOWAIT] >>> LIST_TABLESPACES [PROGRESS] >>> LIST_FILES [TABLESPACE] >>> LIST_WAL_FILES [START_WAL_LOCATION 'X/X'] [END_WAL_LOCATION 'X/X'] >>> >>> The current patches (v11) have been rebased to the latest master. The >>> backup manifest is enabled >>> by default, so I have disabled it for parallel backup mode and have >>> generated a warning so that >>> user is aware of it and not expect it in the backup. >>> >>> Hi Asif >> >> I have verified the bug fixes, one bug is fixed and working now as >> expected >> >> For the verification of the other bug fixes faced following issues, >> please have a look. >> >> >> 1) Following bug fixes mentioned below are generating segmentation fault. >> >> Please note for reference I have added a description only as steps were >> given in previous emails of each bug I tried to verify the fix. Backtrace >> is also added with each case which points to one bug for both the cases. >> >> a) The backup failed with errors "error: could not connect to server: >> could not look up local user ID 1000: Too many open files" when the >> max_wal_senders was set to 2000. >> >> >> [edb@localhost bin]$ ./pg_basebackup -v -j 1990 -D >> /home/edb/Desktop/backup/ >> pg_basebackup: warning: backup manifest is disabled in parallel backup >> mode >> pg_basebackup: initiating base backup, waiting for checkpoint to complete >> pg_basebackup: checkpoint completed >> pg_basebackup: write-ahead log start point: 0/228 on timeline 1 >> pg_basebackup: starting background WAL receiver >> pg_basebackup: created temporary replication slot "pg_basebackup_9925" >> pg_basebackup: backup worker (0) created >> pg_basebackup: backup worker (1) created >> pg_basebackup: backup worker (2) created >> pg_basebackup: backup worker (3) created >> …. >> …. >> pg_basebackup: backup worker (1014) created >> pg_basebackup: backup worker (1015) created >> pg_basebackup: backup worker (1016) created >> pg_basebackup: backup worker (1017) created >> pg_basebackup: error: could not connect to server: could not look up >> local user ID 1000: Too many open files >> Segmentation fault >> [edb@localhost bin]$ >> >> >> [edb@localhost bin]$ >> [edb@localhost bin]$ gdb pg_basebackup >> /tmp/cores/core.pg_basebackup.13219.localhost.localdomain.1586349551 >> GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-115.el7 >> Copyright (C) 2013 Free Software Foundation, Inc. >> License GPLv3+: GNU GPL version 3 or later < >> http://gnu.org/licenses/gpl.html> >> This is free software: you are free to change and redistribute it. >> There is NO WARRANTY, to the extent permitted by law. Type "show co
While restoring -getting error if dump contain sql statements generated from generated.sql file
Hi , We have a sql file called 'generated.sql' under src/test/regress/sql folder . if we run this file on psql , take the dump and try to restore it on another db we are getting error like - psql:/tmp/x:434: ERROR: column "b" of relation "gtest1_1" is a generated column psql:/tmp/x:441: ERROR: cannot use column reference in DEFAULT expression These sql statements , i copied from the dump file postgres=# CREATE TABLE public.gtest30 ( postgres(# a integer, postgres(# b integer postgres(# ); CREATE TABLE postgres=# postgres=# CREATE TABLE public.gtest30_1 ( postgres(# ) postgres-# INHERITS (public.gtest30); CREATE TABLE postgres=# ALTER TABLE ONLY public.gtest30_1 ALTER COLUMN b SET DEFAULT (a * 2); ERROR: cannot use column reference in DEFAULT expression postgres=# Steps to reproduce - connect to psql - ( ./psql postgres) create database ( create database x;) connect to database x (\c x ) execute generated.sql file (\i ../../src/test/regress/sql/generated.sql) take the dump of x db (./pg_dump -Fp x > /tmp/t.dump) create another database (create database y;) Connect to y db (\c y) execute plain dump sql file (\i /tmp/t.dump) -- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company
Re: Race condition in SyncRepGetSyncStandbysPriority
Kyotaro Horiguchi writes: > SyncRepGetSyncStandbysPriority() is runing holding SyncRepLock so > sync_standby_priority of any walsender can be changed while the > function is scanning welsenders. The issue is we already have > inconsistent walsender information before we enter the function. Thus > how many times we scan on the array doesn't make any difference. *Yes it does*. The existing code can deliver entirely broken results if some walsender exits between where we examine the priorities and where we fetch the WAL pointers. While that doesn't seem to be the exact issue we're seeing in the buildfarm, it's still another obvious bug in this code. I will not accept a "fix" that doesn't fix that. > I think we need to do one of the followings. > A) prevent SyncRepGetSyncStandbysPriority from being entered while > walsender priority is inconsistent. > B) make SyncRepGetSyncStandbysPriority be tolerant of priority > inconsistency. > C) protect walsender priority array from beinig inconsistent. (B) seems like the only practical solution from here. We could probably arrange for synchronous update of the priorities when they change in response to a GUC change, but it doesn't seem to me to be practical to do that in response to walsender exit. You'd end up finding that an unexpected walsender exit results in panic'ing the system, which is no better than where we are now. It doesn't seem to me to be that hard to implement the desired semantics for synchronous_standby_names with inconsistent info. In FIRST mode you basically just need to take the N smallest priorities you see in the array, but without assuming there are no duplicates or holes. It might be a good idea to include ties at the end, that is if you see 1,2,2,4 or 1,3,3,4 and you want 2 sync standbys, include the first three of them in the calculation until the inconsistency is resolved. In ANY mode I don't see that inconsistent priorities matter at all. > If we accept to share variable-length information among processes, > sharing sync_standby_names or parsed SyncRepConfigData among processes > would work. Not sure that we really need more than what's being shared now, ie each process's last-known index in the sync_standby_names list. regards, tom lane
Re: where should I stick that backup?
On Sun, Apr 12, 2020 at 8:27 PM Andres Freund wrote: > > That's quite appealing. One downside - IMHO significant - is that you > > have to have a separate process to do *anything*. If you want to add a > > filter that just logs everything it's asked to do, for example, you've > > gotta have a whole process for that, which likely adds a lot of > > overhead even if you can somehow avoid passing all the data through an > > extra set of pipes. The interface I proposed would allow you to inject > > very lightweight filters at very low cost. This design really doesn't. > > Well, in what you described it'd still be all done inside pg_basebackup, > or did I misunderstand? Once you fetched it from the server, I can't > imagine the overhead of filtering it a bit differently would matter. > > But even if, the "target" could just reply with "skip" or such, instead > of providing an fd. > > What kind of filtering are you thinking of where this is a problem? > Besides just logging the filenames? I just can't imagine how that's a > relevant overhead compared to having to do things like > 'shell ssh rhaas@depository pgfile create-exclusive - %f.lz4' Anything you want to do in the same process. I mean, right now we have basically one target (filesystem) and one filter (compression). Neither of those things spawn a process. It seems logical to imagine that there might be other things that are similar in the future. It seems to me that there are definitely things where you will want to spawn a process; that's why I like having shell commands as one option. But I don't think we should require that you can't have a filter or a target unless you also spawn a process for it. > I really think we want the option to eventually do this server-side. And > I don't quite see it as viable to go for an API that allows to specify > shell fragments that are going to be executed server side. The server-side thing is a good point, but I think it adds quite a bit of complexity, too. I'm worried that this is ballooning to an unworkable amount of complexity - and not just code complexity, but bikeshedding complexity, too. Like, I started with a command-line option that could probably have been implemented in a few hundred lines of code. Now, we're up to something where you have to build custom processes that speak a novel protocol and work on both the client and the server side. That's at least several thousand lines of code, maybe over ten thousand if the sample binaries that use the new protocol are more than just simple demonstrations of how to code to the interface. More importantly, it means agreeing on the nature of this custom protocol, which seems like something where I could put in a ton of effort to create something and then have somebody complain because it's not JSON, or because it is JSON, or because the capability negotiation system isn't right, or whatever. I'm not exactly saying that we shouldn't do it; I think it has some appeal. But I'd sure like to find some way of getting started that doesn't involve having to do everything in one patch, and then getting told to change it all again - possibly with different people wanting contradictory things. > > Note that you could build this on top of what I proposed, but not the > > other way around. > > Why should it not be possible the other way round? Because a C function call API lets you decide to spawn a process, but if the framework inherently spawns a process, you can't decide not to do so in a particular case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Poll: are people okay with function/operator table redesign?
On 4/13/20 7:13 PM, Tom Lane wrote: As discussed in the thread at [1], I've been working on redesigning the tables we use to present SQL functions and operators. The first installment of that is now up; see tables 9.30 and 9.31 at https://www.postgresql.org/docs/devel/functions-datetime.html and table 9.33 at https://www.postgresql.org/docs/devel/functions-enum.html Before I spend more time on this, I want to make sure that people are happy with this line of attack. Comparing these tables to the way they look in v12, they clearly take more vertical space; but at least to my eye they're less cluttered and more readable. They definitely scale a lot better for cases where a long function description is needed, or where we'd like to have more than one example. Does anyone prefer the old way, or have a better idea? I know that the table headings are a bit weirdly laid out; hopefully that can be resolved [2]. I prefer the old way since I find it very hard to see which fields belong to which function in the new way. I think what confuses my eyes is how some rows are split in half while others are not, especially for those functions where there is only one example output. I do not have any issue reading those with many example outputs. For the old tables I can at least just make the browser window ridiculously wide ro read them. Andreas
Re: index paths and enable_indexscan
On Tue, Apr 14, 2020 at 6:12 PM Andy Fan wrote: > On Tue, Apr 14, 2020 at 4:58 PM Amit Langote wrote: >> I am saying that instead of building index path with disabled cost, >> just don't build it at all. A base rel will always have a sequetial >> path, even though with disabled cost if enable_seqscan = off. > > Let's say user set enable_seqscan=off and set enable_indexscan=off; > will you expect user to get seqscan at last? If so, why is seqscan > (rather than index scan) since both are disabled by user equally? I was really thinking of this in terms of planner effort, which for creating an index path is more than creating sequential path, although sure the payoff can be great. That is, I want the planner to avoid creating index paths *to save cycles*, but see no way of making that happen. I was thinking disabling enable_indexscan would do the trick. -- Amit Langote EnterpriseDB: http://www.enterprisedb.com
Re: index paths and enable_indexscan
Amit Langote writes: > I am saying that instead of building index path with disabled cost, > just don't build it at all. A base rel will always have a sequetial > path, even though with disabled cost if enable_seqscan = off. Awhile back I'd looked into getting rid of disable_cost altogether by dint of not generating disabled paths. It's harder than it sounds. We could perhaps change this particular case, but it's not clear that there's any real benefit of making this one change in isolation. Note that you can't just put a big OFF switch at the start of indxpath.c, because enable_indexscan and enable_bitmapscan are distinct switches, but the code to generate those path types is inextricably intertwined. Skipping individual paths further down on the basis of the appropriate switch would be fairly subtle and perhaps bug-prone. The existing implementation of those switches has the advantages of being trivially simple and clearly correct (for some value of "correct"). regards, tom lane
Re: Poll: are people okay with function/operator table redesign?
Andreas Karlsson writes: > For the old tables I can at least just make the browser window > ridiculously wide ro read them. A large part of the point here is to make the tables usable when you don't have that option, as for example in PDF output. Even with a wide window, though, some of our function tables are monstrously ugly. regards, tom lane
Re: index paths and enable_indexscan
Amit Langote writes: > I was really thinking of this in terms of planner effort, which for > creating an index path is more than creating sequential path, although > sure the payoff can be great. That is, I want the planner to avoid > creating index paths *to save cycles*, but see no way of making that > happen. I was thinking disabling enable_indexscan would do the trick. I think that's completely misguided, because in point of fact nobody is going to care about the planner's performance with enable_indexscan turned off. It's not an interesting production case. All of these enable_xxx switches exist just for debug purposes, and so the right way to think about them is "what's the simplest, least bug-prone, lowest-maintenance way to get the effect?". Likewise, I don't actually much care what results you get if you turn off *all* of them. It's not a useful case to spend our effort on. regards, tom lane
Re: Poll: are people okay with function/operator table redesign?
On 4/14/20 4:29 PM, Tom Lane wrote: Andreas Karlsson writes: For the old tables I can at least just make the browser window ridiculously wide ro read them. A large part of the point here is to make the tables usable when you don't have that option, as for example in PDF output. Even with a wide window, though, some of our function tables are monstrously ugly. Sure, but I wager the number of people using the HTML version of our documentation on laptops and desktop computers are the biggest group of users. That said, I agree with that quite many of our tables right now are ugly, but I prefer ugly to hard to read. For me the mix of having every third row split into two fields makes the tables very hard to read. I have a hard time seeing which rows belong to which function. Andreas
Re: Poll: are people okay with function/operator table redesign?
Andreas Karlsson writes: > That said, I agree with that quite many of our tables right now are > ugly, but I prefer ugly to hard to read. For me the mix of having every > third row split into two fields makes the tables very hard to read. I > have a hard time seeing which rows belong to which function. Did you look at the variants without that discussed downthread? regards, tom lane
Re: Poll: are people okay with function/operator table redesign?
On Mon, Apr 13, 2020 at 4:29 PM Tom Lane wrote: > I wouldn't be averse to dropping the text descriptions for operators > in places where they seem obvious ... but who decides what is obvious? Well, we do. We're smart, right? I don't think it's a good idea to add clutter to table A just because table B needs more details. What matters is whether table A needs more details. The v12 version of the "Table 9.30. Date/Time Operators" is not that wide, and is really quite clear. The new version takes 3 lines per operator where the old one took one. That's because you've added (1) a description of the fact that + does addition and - does subtraction, repeated for each operator, and (2) explicit information about the input and result types. I don't think either add much, in this case. The former doesn't really need to be explained, and the latter was clear enough from the way the examples were presented - everything had explicit types. For more complicated cases, one thing we could do is ditch the table and use a with a separate for each operator. So you could have something like: date + date &arrow; timestamp Lengthy elocution, including an example. But I would only advocate for this style in cases where there is substantial explaining to be done. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: where should I stick that backup?
Greetings, * Robert Haas (robertmh...@gmail.com) wrote: > On Sun, Apr 12, 2020 at 8:27 PM Andres Freund wrote: > > I really think we want the option to eventually do this server-side. And > > I don't quite see it as viable to go for an API that allows to specify > > shell fragments that are going to be executed server side. > > The server-side thing is a good point, but I think it adds quite a bit > of complexity, too. I'm worried that this is ballooning to an > unworkable amount of complexity - and not just code complexity, but > bikeshedding complexity, too. Like, I started with a command-line > option that could probably have been implemented in a few hundred > lines of code. Now, we're up to something where you have to build > custom processes that speak a novel protocol and work on both the > client and the server side. That's at least several thousand lines of > code, maybe over ten thousand if the sample binaries that use the new > protocol are more than just simple demonstrations of how to code to > the interface. More importantly, it means agreeing on the nature of > this custom protocol, which seems like something where I could put in > a ton of effort to create something and then have somebody complain > because it's not JSON, or because it is JSON, or because the > capability negotiation system isn't right, or whatever. I'm not > exactly saying that we shouldn't do it; I think it has some appeal. > But I'd sure like to find some way of getting started that doesn't > involve having to do everything in one patch, and then getting told to > change it all again - possibly with different people wanting > contradictory things. Doing things incrementally and not all in one patch absolutely makes a lot of sense and is a good idea. Wouldn't it make sense to, given that we have some idea of what we want it to eventually look like, to make progress in that direction though? That is- I tend to agree with Andres that having this supported server-side eventually is what we should be thinking about as an end-goal (what is the point of pg_basebackup in all of this, after all, if the goal is to get a backup of PG from the PG server to s3..? why go through some other program or through the replication protocol?) and having the server exec'ing out to run shell script fragments to make that happen looks like it would be really awkward and full of potential risks and issues and agreement that it wouldn't be a good fit. If, instead, we worked on a C-based interface which includes filters and storage drivers, and was implemented through libpgcommon, we could start with that being all done through pg_basebackup and work to hammer out the complications and issues that we run into there and, once it seems reasonably stable and works well, we could potentially pull that into the backend to be run directly without having to have pg_basebackup involved in the process. There's been good progress in the direction of having more done by the backend already, and that's thanks to you and it's good work- specifically that the backend now has the ability to generate a manifest, with checksums included as the backup is being run, which is definitely an important piece. Thanks, Stephen signature.asc Description: PGP signature
Re: Poll: are people okay with function/operator table redesign?
Robert Haas writes: > The v12 version of the "Table 9.30. Date/Time Operators" is not that > wide, and is really quite clear. Well, no it isn't. The main nit I would pick in that claim is that it's far from obvious that the three examples of float8 * interval are all talking about the same operator; in fact, a reader would be very likely to draw the false conclusion that there is an integer * interval operator. This is an aspect of the general problem that we don't have a nice way to deal with multiple examples in the tables. Somebody kluged their way around it here in this particular way, but I'd really like a clearer way, because we need more examples. I would also point out that this table is quite out of step with the rest of the docs in its practice of showing the results as though they were typed literals. Most places that show results just show what you'd expect to see in a psql output column, making it necessary to show the result data type somewhere else. > The new version takes 3 lines per > operator where the old one took one. That's because you've added (1) a > description of the fact that + does addition and - does subtraction, > repeated for each operator, and (2) explicit information about the > input and result types. I don't think either add much, in this case. As I already said, I agree about the text descriptions being of marginal value in this case. I disagree about the explicit datatypes, because the float8 * interval cases already show a hole in that argument, and surely we don't want to require every example to use explicitly-typed literals and nothing but. Besides, what will you do for operators that take anyarray or the like? > For more complicated cases, one thing we could do is ditch the table > and use a with a separate for each > operator. So you could have something like: > ... > But I would only advocate for this style in cases where there is > substantial explaining to be done. I'd like to have more consistency, not less. I do not think it helps readers to have each page in Chapter 9 have its own idiosyncratic way of presenting operators/functions. The operator tables are actually almost that bad, right now --- compare section 9.1 (hasn't even bothered with a formal ) with tables 9.1, 9.4, 9.9, 9.12, 9.14, 9.30, 9.34, 9.37, 9.41, 9.44. The variation in level of detail and precision is striking, and not in a good way. regards, tom lane
Re: Properly mark NULL returns in numeric aggregates
Hi David, On Mon, Apr 13, 2020 at 10:46 PM David Rowley wrote: > > On Tue, 14 Apr 2020 at 06:14, Tom Lane wrote: > > Yeah, they're relying exactly on the assumption that nodeAgg is not > > going to try to copy a value declared "internal", and therefore they > > can be loosey-goosey about whether the value pointer is null or not. > > However, if you want to claim that that's wrong, you have to explain > > why it's okay for some other code to be accessing a value that's > > declared "internal". I'd say that the meaning of that is precisely > > "keepa u hands off". > > > > In the case at hand, the current situation is that we only expect the > > values returned by these combine functions to be read by the associated > > final functions, which are on board with the null-pointer representation > > of an empty result. Your argument is essentially that it should be > > possible to feed the values to the aggregate's associated serialization > > function as well. But the core code never does that, so I'm not convinced > > that we should add it to the requirements; we'd be unable to test it. > > Casting my mind back to when I originally wrote that code, I attempted > to do so in such a way so that it could one day be used for a 3-stage > aggregation. e.g Parallel Partial Aggregate -> Gather -> Combine > Serial Aggregate on one node, then on some master node a Deserial > Combine Finalize Aggregate. You're very right that we can't craft > such a plan with today's master (We didn't even add a supporting enum > for it in AggSplit). However, it does appear that there are > extensions or forks out there which attempt to use the code in this > way, so it would be good to not leave those people out in the cold > regarding this. Greenplum plans split-aggregation quite similarly from Postgres: while it doesn't pass partial results through a intra-cluster "Gather" -- using a reshuffle-by-hash type operation instead -- Greenplum _does_ split an aggregate into final and partial halves, running them on different nodes. In short, the relation ship among the combine, serial, and deserial functions are similar to how they are in Postgres today (serial->deserial->combine), in the context of splitting aggregates. The current problem arises because Greenplum spills the hash table in hash aggregation (a diff we're working actively to upstream), a process in which we have to touch (read: serialize and copy) the internal trans values. However, we are definitely eyeing what you described as something to move towards. As a fork, we'd like to carry as thin a diff as possible. So the current situation is pretty much forcing us to diverge in the functions mentioned up-thread. In hindsight, "sloppy" might not have been a wise choice of words, apologies for the possible offense, David! > > For testing, can't we just have an Assert() in > advance_transition_function that verifies isnull matches the > nullability of the return value for INTERNAL returning transfns? i.e, > the attached > > I don't have a test case to hand that could cause this to fail, but it > sounds like Jesse might. One easy way to cause this is "sum(x) FILTER (WHERE false)" which will for sure make the partial results NULL. Is that what you're looking for? I'll be happy to send in the SQL. Cheers, Jesse
Re: where should I stick that backup?
On Tue, Apr 14, 2020 at 11:08 AM Stephen Frost wrote: > Wouldn't it make sense to, given that we have some idea of what we want > it to eventually look like, to make progress in that direction though? Well, yes. :-) > That is- I tend to agree with Andres that having this supported > server-side eventually is what we should be thinking about as an > end-goal (what is the point of pg_basebackup in all of this, after all, > if the goal is to get a backup of PG from the PG server to s3..? why > go through some other program or through the replication protocol?) and > having the server exec'ing out to run shell script fragments to make > that happen looks like it would be really awkward and full of potential > risks and issues and agreement that it wouldn't be a good fit. I'm fairly deeply uncomfortable with what Andres is proposing. I see that it's very powerful, and can do a lot of things, and that if you're building something that does sophisticated things with storage, you probably want an API like that. It does a great job making complicated things possible. However, I feel that it does a lousy job making simple things simple. Suppose you want to compress using your favorite compression program. Well, you can't. Your favorite compression program doesn't speak the bespoke PostgreSQL protocol required for backup plugins. Neither does your favorite encryption program. Either would be perfectly happy to accept a tarfile on stdin and dump out a compressed or encrypted version, as the case may be, on stdout, but sorry, no such luck. You need a special program that speaks the magic PostgreSQL protocol but otherwise does pretty much the exact same thing as the standard one. It's possibly not the exact same thing. A special might, for example, use multiple threads for parallel compression rather than multiple processes, perhaps gaining a bit of efficiency. But it's doubtful whether all users care about such marginal improvements. All they're going to see is that they can use gzip and maybe lz4 because we provide the necessary special magic tools to integrate with those, but for some reason we don't have a special magic tool that they can use with their own favorite compressor, and so they can't use it. I think people are going to find that fairly unhelpful. Now, it's a problem we can work around. We could have a "shell gateway" program which acts as a plugin, speaks the backup plugin protocol, and internally does fork-and-exec stuff to spin up copies of any binary you want to act as a filter. I don't see any real problem with that. I do think it's very significantly more complicated than just what Andres called an FFI. It's gonna be way easier to just write something that spawns shell processes directly than it is to write something that spawns a process and talks to it using this protocol and passes around file descriptors using the various different mechanisms that different platforms use for that, and then that process turns around and spawns some other processes and passes along the file descriptors to them. Now you've added a whole bunch of platform-specific code and a whole bunch of code to generate and parse protocol messages to achieve exactly the same thing that you could've done far more simply with a C API. Even accepting as a given the need to make the C API work separately on both the client and server side, you've probably at least doubled, and I suspect more like quadrupled, the amount of infrastructure that has to be built. So... > If, instead, we worked on a C-based interface which includes filters and > storage drivers, and was implemented through libpgcommon, we could start > with that being all done through pg_basebackup and work to hammer out > the complications and issues that we run into there and, once it seems > reasonably stable and works well, we could potentially pull that into > the backend to be run directly without having to have pg_basebackup > involved in the process. ...let's do this. Actually, I don't really mind if we target something that can work on both the client and server side initially, but based on C, not a new wire protocol with file descriptor passing. That new wire protocol, and the file descriptor passing infrastructure that goes with it, are things that I *really* think should be pushed off to version 2, because I think they're going to generate a lot of additional work and complexity, and I don't want to deal with all of it at once. Also, I don't really see what's wrong with the server forking processes that exec("/usr/bin/lz4") or whatever. We do similar things in other places and, while it won't work for cases where you want to compress a shazillion files, that's not really a problem here anyway. At least at the moment, the server-side format is *always* tar, so the problem of needing a separate subprocess for every file in the data directory does not arise. > There's been good progress in the direction of having more done by the > backend already, and
Re: Properly mark NULL returns in numeric aggregates
David Rowley writes: > For testing, can't we just have an Assert() in > advance_transition_function that verifies isnull matches the > nullability of the return value for INTERNAL returning transfns? i.e, > the attached FTR, I do not like this Assert one bit. nodeAgg.c has NO business inquiring into the contents of internal-type Datums. It has even less business enforcing a particular Datum value for a SQL null --- we have always, throughout the system, considered that if isnull is true then the contents of the Datum are unspecified. I think this is much more likely to cause problems than solve any. regards, tom lane
PG compilation error with Visual Studio 2015/2017/2019
>>I m still working on testing this patch. If anyone has Idea please suggest. I still see problems with this patch. 1. Variable loct have redundant initialization, it would be enough to declare so: _locale_t loct; 2. Style white space in variable rc declaration. 3. Style variable cp_index can be reduced. if (tmp != NULL) { size_t cp_index; cp_index = (size_t)(tmp - winlocname); strncpy(loc_name, winlocname, cp_index); loc_name[cp_index] = '\0'; 4. Memory leak if _WIN32_WINNT >= 0x0600 is true, _free_locale(loct); is not called. 5. Why call _create_locale if _WIN32_WINNT >= 0x0600 is true and loct is not used? regards, Ranier Vilela
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On 2020-04-14 12:10, Dilip Kumar wrote: v14-0001-Immediately-WAL-log-assignments.patch + v14-0002-Issue-individual-invalidations-with.patch + v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch+ v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch+ v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch + v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch+ v14-0007-Track-statistics-for-streaming.patch + v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch + v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch + v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch applied on top of 8128b0c (a few hours ago) Hi, I haven't followed this thread and maybe this instabilty is known/expected; just thought I'd let you know. When doing running a pgbench run over logical replication (cascading down two replicas), I get this segmentation fault. 2020-04-14 17:27:28.135 CEST [8118] DETAIL: Streaming transactions committing after 0/5FA2A38, reading WAL from 0/5FA2A00. 2020-04-14 17:27:28.135 CEST [8118] LOG: logical decoding found consistent point at 0/5FA2A00 2020-04-14 17:27:28.135 CEST [8118] DETAIL: There are no running transactions. 2020-04-14 17:27:28.138 CEST [8006] LOG: server process (PID 8118) was terminated by signal 11: Segmentation fault 2020-04-14 17:27:28.138 CEST [8006] DETAIL: Failed process was running: COMMIT 2020-04-14 17:27:28.138 CEST [8006] LOG: terminating any other active server processes 2020-04-14 17:27:28.138 CEST [8163] WARNING: terminating connection because of crash of another server process 2020-04-14 17:27:28.138 CEST [8163] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2020-04-14 17:27:28.138 CEST [8163] HINT: In a moment you should be able to reconnect to the database and repeat your command. This error happens somewhat buried away in my test-stuff; I can dig it out and make it into a repeatable test if you need it. (debian stretch/gcc 9.3.0) Erik Rijkers
Re: Poll: are people okay with function/operator table redesign?
On Tue, Apr 14, 2020 at 11:26 AM Tom Lane wrote: > Well, no it isn't. The main nit I would pick in that claim is that > it's far from obvious that the three examples of float8 * interval > are all talking about the same operator; in fact, a reader would > be very likely to draw the false conclusion that there is an > integer * interval operator. I agree that's not great. I think that could possibly be fixed by showing all three examples in the same cell, and maybe by revising the choice of examples. > I'd like to have more consistency, not less. I do not think it helps > readers to have each page in Chapter 9 have its own idiosyncratic way of > presenting operators/functions. The operator tables are actually almost > that bad, right now --- compare section 9.1 (hasn't even bothered with > a formal ) with tables 9.1, 9.4, 9.9, 9.12, 9.14, 9.30, 9.34, > 9.37, 9.41, 9.44. The variation in level of detail and precision is > striking, and not in a good way. Well, I don't know. Having two or even three formats is not the same as having infinitely many formats, and may be justified if the needs are sufficiently different from each other. At any rate, if the price of more clarity and more examples is that the tables become three times as long and harder to read, I am somewhat inclined to think that the cure is worse than the disease. I can readily see how something like table 9.10 (Other String Functions) might be a mess on a narrow screen or in PDF format, but it's an extremely useful table on a normal-size screen in HTML format, and part of what makes it useful is that it's compact. Almost anything we do is going to remove some of that compactness to save horizontal space. Maybe that's OK, but it's sure not great. It's nice to be able to see more on one screen. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: cleaning perl code
On 4/13/20 12:47 PM, Andrew Dunstan wrote: > > OK, I've committed all that stuff. I think that takes care of the > non-controversial part of what I proposed :-) > > OK, it seems there is a majority of people commenting in this thread in favor of not doing more except to reverse the policy of requiring subroutine returns. I'll do that shortly. In the spirit of David Steele's contribution, here is a snippet that when added to the perlcriticrc would allow us to pass at the "brutal" setting (severity 1). But I'm not proposing to add this, it's just here so anyone interested can see what's involved. One of the things that's a bit sad is that perlcritic doesn't generally let you apply policies to a given set of files or files matching some pattern. It would be nice, for instance, to be able to apply some additional standards to strategic library files like PostgresNode.pm, TestLib.pm and Catalog.pm. There are good reasons as suggested upthread to apply higher standards to library files than to, say, a TAP test script. The only easy way I can see to do that would be to have two different perlcriticrc files and adjust pgperlcritic to make two runs. If people think that's worth it I'll put a little work into it. If not, I'll just leave things here. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services # severity 4 [-BuiltinFunctions::RequireBlockGrep] # 9 occurrences [-BuiltinFunctions::RequireBlockMap] # 1 occurrences [-InputOutput::ProhibitReadlineInForLoop] # 2 occurrences [-InputOutput::RequireBriefOpen] # 83 occurrences [-Modules::ProhibitAutomaticExportation] # 7 occurrences [-Modules::ProhibitMultiplePackages] # 9 occurrences [-Objects::ProhibitIndirectSyntax] # 12 occurrences [-Subroutines::RequireArgUnpacking] # 39 occurrences [-TestingAndDebugging::ProhibitNoWarnings] # 6 occurrences [-TestingAndDebugging::ProhibitProlongedStrictureOverride] # 2 occurrences [-ValuesAndExpressions::ProhibitCommaSeparatedStatements] # 4 occurrences [-ValuesAndExpressions::ProhibitConstantPragma] # 2 occurrences [-ValuesAndExpressions::ProhibitMixedBooleanOperators] # 2 occurrences [-Variables::RequireLocalizedPunctuationVars] # 72 occurrences # severity 3 [-BuiltinFunctions::ProhibitComplexMappings] # 2 occurrences [-BuiltinFunctions::ProhibitLvalueSubstr] # 1 occurrences [-BuiltinFunctions::ProhibitVoidMap] # 1 occurrences [-BuiltinFunctions::RequireSimpleSortBlock] # 1 occurrences [-ClassHierarchies::ProhibitExplicitISA] # 10 occurrences [-CodeLayout::ProhibitHardTabs] # 172 occurrences [-ControlStructures::ProhibitCascadingIfElse] # 15 occurrences [-ControlStructures::ProhibitDeepNests] # 1 occurrences [-ControlStructures::ProhibitNegativeExpressionsInUnlessAndUntilConditions] # 5 occurrences [-ErrorHandling::RequireCarping] # 284 occurrences [-ErrorHandling::RequireCheckingReturnValueOfEval] # 11 occurrences [-InputOutput::ProhibitBacktickOperators] # 13 occurrences [-InputOutput::ProhibitJoinedReadline] # 1 occurrences [-InputOutput::RequireCheckedOpen] # 24 occurrences [-Miscellanea::ProhibitUnrestrictedNoCritic] # 12 occurrences [-Modules::ProhibitConditionalUseStatements] # 1 occurrences [-Modules::ProhibitExcessMainComplexity] # 15 occurrences [-NamingConventions::ProhibitAmbiguousNames] # 3 occurrences [-RegularExpressions::ProhibitCaptureWithoutTest] # 30 occurrences [-RegularExpressions::ProhibitComplexRegexes] # 267 occurrences [-RegularExpressions::ProhibitUnusedCapture] # 11 occurrences [-RegularExpressions::RequireExtendedFormatting] # 1048 occurrences [-Subroutines::ProhibitExcessComplexity] # 13 occurrences [-Subroutines::ProhibitManyArgs] # 9 occurrences [-Subroutines::ProhibitUnusedPrivateSubroutines] # 3 occurrences [-TestingAndDebugging::RequireTestLabels] # 4 occurrences [-ValuesAndExpressions::ProhibitImplicitNewlines] # 312 occurrences [-ValuesAndExpressions::ProhibitMismatchedOperators] # 11 occurrences [-ValuesAndExpressions::ProhibitQuotesAsQuotelikeOperatorDelimiters] # 2 occurrences [-ValuesAndExpressions::ProhibitVersionStrings] # 1 occurrences [-ValuesAndExpressions::RequireQuotedHeredocTerminator] # 98 occurrences [-Variables::ProhibitPackageVars] # 79 occurrences [-Variables::ProhibitReusedNames] # 14 occurrences [-Variables::ProhibitUnusedVariables] # 8 occurrences [-Variables::RequireInitializationForLocalVars] # 3 occurrences # severity 2 [-BuiltinFunctions::ProhibitBooleanGrep] # 14 occurrences [-BuiltinFunctions::ProhibitStringySplit] # 8 occurrences [-BuiltinFunctions::ProhibitUselessTopic] # 4 occurrences [-CodeLayout::ProhibitQuotedWordLists] # 13 occurrences [-ControlStructures::ProhibitCStyleForLoops] # 26 occurrences [-ControlStructures::ProhibitPostfixControls] # 325 occurrences [-ControlStructures::ProhibitUnlessBlocks] # 12 occurrences [-Documentation::RequirePodSections] # 51 occurrences [-InputOutput::RequireCheckedClose] # 140 occurrences [-Miscellanea::ProhibitTies] #
Re: Poll: are people okay with function/operator table redesign?
Robert Haas writes: > At any rate, if the price of more clarity and more examples is that > the tables become three times as long and harder to read, I am > somewhat inclined to think that the cure is worse than the disease. I > can readily see how something like table 9.10 (Other String Functions) > might be a mess on a narrow screen or in PDF format, but it's an > extremely useful table on a normal-size screen in HTML format, and > part of what makes it useful is that it's compact. Almost anything we > do is going to remove some of that compactness to save horizontal > space. Maybe that's OK, but it's sure not great. It's nice to be able > to see more on one screen. I dunno, it doesn't look to me like 9.10 is some paragon of efficient use of screen space, even with a wide window. (And my goodness it looks bad if I try a window about half my usual web-browsing width.) Maybe I should go convert that one to see what it looks like in one of the other layouts being discussed. regards, tom lane
Re: [BUG] non archived WAL removed during production crash recovery
On Fri, 10 Apr 2020 11:00:31 +0900 (JST) Kyotaro Horiguchi wrote: [...] > > > but the mistake here is it thinks that inRecovery represents whether it is > > > running as a standby or not, but actually it is true on primary during > > > crash recovery. > > > > Indeed. > > > > > On the other hand, with the patch, standby with archive_mode=on > > > wrongly archives WAL files during crash recovery. > > > > "without the patch" you mean? You are talking about 78ea8b5daab, right? > > No. I menat the v4 patch in [1]. > > [1] https://www.postgresql.org/message-id/20200407171736.61906608%40firost > > Prior to appllying the patch (that is the commit 78ea..), > XLogArchiveCheckDone() correctly returns true (= allow to remove it) > for the same condition. > > The proposed patch does the folloing thing. > > if (!XLogArchivingActive() || > recoveryState == ARCHIVE_RECOVERING && !XLogArchivingAlways()) > return true; > > It doesn't return for the condition "recoveryState=CRASH_RECOVERING > and archive_mode = on". Then the WAL files is mitakenly marked as > ".ready" if not marked yet. Indeed. But .ready files are then deleted during the first restartpoint. I'm not sure how to fix this behavior without making the code too complex. This is discussed in my last answer to Michael few minutes ago as well. > By the way, the code seems not following the convention a bit > here. Let the inserting code be in the same style to the existing code > around. > > + if ( ! XLogArchivingActive() ) > > I think we don't put the spaces within the parentheses above. Indeed, This is fixed in patch v5 sent few minutes ago. > | ARCHIVE_RECOVERING/CRASH_RECOVERING/NOT_IN_RECOVERY > > The first two and the last one are in different style. *I* prefer them > (if we use it) be "IN_ARCHIVE_RECOVERY/IN_CRASH_RECOVERY/NOT_IN_RECOVERY". I like Michael's proposal. See v5 of the patch. Thank you for your review! Regards,
ERROR: could not determine which collation to use for string comparison
Guys; This errors out with: ERROR: could not determine which collation to use for string comparison HINT: Use the COLLATE clause to set the collation explicitly. The database is init'ed with: initdb -D $PGDATA -E utf8 --locale=nb_NO.UTF-8 13-dev HEAD as of 8128b0c152a67917535f50738ac26da4f984ddd9 Works fine in <= 12 === create table person( id serial primary key, firstname varchar, lastname varchar );insert into person(firstname, lastname) values ('Andreas', 'Krogh'); CREATE OR REPLACE FUNCTIONconcat_lower(varchar, varchar) RETURNS varchar AS $$ SELECT nullif(lower(coalesce($1, '')) || lower(coalesce($2, '')), '') $$ LANGUAGE SQL IMMUTABLE; select * from person pers ORDER BY concat_lower(pers.firstname, pers.lastname)ASC; === -- Andreas Joseph Krogh
Re: documenting the backup manifest file format
On Mon, Apr 13, 2020 at 5:43 PM Alvaro Herrera wrote: > Yeah, I guess I'm just saying that it feels brittle to have a file > format that's supposed to be good for data exchange and then make it > itself depend on representation details such as the order that fields > appear in, the letter case, or the format of newlines. Maybe this isn't > really of concern, but it seemed strange. I didn't want to use JSON for this at all, but I got outvoted. When I raised this issue, it was suggested that I deal with it in this way, so I did. I can't really defend it too far beyond that, although I do think that one nice thing about this is that you can verify the checksum using shell commands if you want. Just figure out the number of lines in the file, minus one, and do head -n$LINES backup_manifest | shasum -a256 and boom. If there were some whitespace-skipping thing figuring out how to reproduce the checksum calculation would be hard. > I think strict ISO 8601 might be preferable (with the T in the middle > and ending in Z instead of " GMT"). Hmm, did David suggest that before? I don't recall for sure. I think he had some suggestion, but I'm not sure if it was the same one. > > > Why is the top-level checksum only allowed to be SHA-256, if the > > > files can use up to SHA-512? > > Thanks for the discussion. I think you mostly want to make sure that > the manifest is sensible (not corrupt) rather than defend against > somebody maliciously giving you an attacking manifest (??). I incline > to agree that any SHA-2 hash is going to serve that purpose and have no > further comment to make. The code has other sanity checks against the manifest failing to parse properly, so you can't (I hope) crash it or anything even if you falsify the checksum. But suppose that there is a gremlin running around your system flipping occasional bits. If said gremlin flips a bit in a "0" that appears in a file's checksum string, it could become a "1", a "3", or a "7", all of which are still valid characters for a hex string. When you then tried to verify the backup, verification for that file would fail, but you'd think it was a problem with the file, rather than a problem with the manifest. The manifest checksum prevents that: you'll get a complaint about the manifest checksum being wrong rather than a complaint about the file not matching the manifest checksum. A sufficiently smart gremlin could figure out the expected checksum for the revised manifest and flip bits to make the actual value match the expected one, but I think we're worried about "chaotic neutral" gremlins, not "lawful evil" ones. That having been said, there was some discussion on the original thread about keeping your backup on regular storage and your manifest checksum in a concrete bunker at the bottom of the ocean; in that scenario, it should be possible to detect tampering in either the manifest itself or in non-WAL data files, as long as the adversary can't break SHA-256. But I'm not sure how much we should really worry about that. For me, the design center for this feature is a user who untars base.tar and forgets about 43965.tar. If that person runs pg_verifybackup, it's gonna tell them that things are broken, and that's good enough for me. It may not be good enough for everybody, but it's good enough for me. I think I'm going to go ahed and push this now, maybe with a small wording tweak as discussed upthread with Andrew. The rest of this discussion is really about whether the patch needs any design changes rather than about whether the documentation describes what the patch does, so it makes sense to me to commit this first and then if somebody wants to argue for a change they certainly can. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: documenting the backup manifest file format
On 4/14/20 12:56 PM, Robert Haas wrote: On Mon, Apr 13, 2020 at 5:43 PM Alvaro Herrera wrote: Yeah, I guess I'm just saying that it feels brittle to have a file format that's supposed to be good for data exchange and then make it itself depend on representation details such as the order that fields appear in, the letter case, or the format of newlines. Maybe this isn't really of concern, but it seemed strange. I didn't want to use JSON for this at all, but I got outvoted. When I raised this issue, it was suggested that I deal with it in this way, so I did. I can't really defend it too far beyond that, although I do think that one nice thing about this is that you can verify the checksum using shell commands if you want. Just figure out the number of lines in the file, minus one, and do head -n$LINES backup_manifest | shasum -a256 and boom. If there were some whitespace-skipping thing figuring out how to reproduce the checksum calculation would be hard. I think strict ISO 8601 might be preferable (with the T in the middle and ending in Z instead of " GMT"). Hmm, did David suggest that before? I don't recall for sure. I think he had some suggestion, but I'm not sure if it was the same one. "I'm also partial to using epoch time in the manifest because it is generally easier for programs to work with. But, human-readable doesn't suck, either." Also you don't need to worry about time-zone conversion errors -- even if the source time is UTC this can easily happen if you are not careful. It also saves a parsing step. The downside is it is not human-readable but this is intended to be a machine-readable format so I don't think it's a big deal (encoded filenames will be just as opaque). If a user really needs to know what time some file is (rare, I think) they can paste it with a web tool to find out. Regards, -- -David da...@pgmasters.net
Re: sqlsmith crash incremental sort
On Sun, Apr 12, 2020 at 8:09 PM Tomas Vondra wrote: > > On Sun, Apr 12, 2020 at 12:44:45AM +0200, Tomas Vondra wrote: > >Hi, > > > >I've looked into this a bit, and at first I thought that maybe the > >issue is in how cost_incremental_sort picks the EC members. It simply > >does this: > > > >EquivalenceMember *member = (EquivalenceMember *) > >linitial(key->pk_eclass->ec_members); > > > >so I was speculating that maybe there are multiple EC members and the > >one we need is not the first one. That would have been easy to fix. > > > >But that doesn't seem to be the case - in this example the EC ony has a > >single EC member anyway. > > > >(gdb) p key->pk_eclass->ec_members > >$14 = (List *) 0x12eb958 > >(gdb) p *key->pk_eclass->ec_members > >$15 = {type = T_List, length = 1, max_length = 5, elements = 0x12eb970, > > initial_elements = 0x12eb970} > > > >and the member is a Var with varno=0 (with a RelabelType on top, but > >that's irrelevant). > > > >(gdb) p *(Var*)((RelabelType*)member->em_expr)->arg > >$12 = {xpr = {type = T_Var}, varno = 0, varattno = 1, vartype = 12445, > > vartypmod = -1, varcollid = 950, varlevelsup = 0, varnosyn = 0, varattnosyn > > = 1, location = -1} > > > >which then triggers the assert in find_base_rel. When looking for other > >places calling estimate_num_groups I found this in prepunion.c: > > > > * XXX you don't really want to know about this: we do the estimation > > * using the subquery's original targetlist expressions, not the > > * subroot->processed_tlist which might seem more appropriate. The > > * reason is that if the subquery is itself a setop, it may return a > > * processed_tlist containing "varno 0" Vars generated by > > * generate_append_tlist, and those would confuse estimate_num_groups > > * mightily. We ought to get rid of the "varno 0" hack, but that > > * requires a redesign of the parsetree representation of setops, so > > * that there can be an RTE corresponding to each setop's output. > > > >which seems pretty similar to the issue at hand, because the subpath is > >T_UpperUniquePath (not sure if that passes as setop, but the symptoms > >match nicely). > > > >Not sure what to do about it in cost_incremental_sort, though :-( > > > > I've been messing with this the whole day, without much progress :-( > > I'm 99.% sure it's the same issue described by the quoted comment, > because the plan looks like this: > > Nested Loop Left Join > -> Sample Scan on pg_namespace > Sampling: system ('7.2'::real) > -> Incremental Sort > Sort Key: ... > Presorted Key: ... > -> Unique > -> Sort > Sort Key: ... > -> Append > -> Nested Loop > ... > -> Nested Loop > ... > > so yeah, the plan does have set operations, and generate_append_tlist > does generate Vars with varno == 0, causing this issue. This is a bit of an oddly shaped plan anyway, right? In an ideal world the sort for the unique would have knowledge about what would be useful for the parent node, and we wouldn't need the incremental sort at all. I'm not sure that that kind of thing is really a new problem, though, and it might not even be entirely possible to fix directly by trying to push down knowledge about useful sort keys to whatever created that sort path; it might only be fixable by having the incremental sort (or even regular sort) path creation know to "subsume" a sort underneath it. Anyway, I think that's a bit off topic, but it stood out to me. > But I'm not entirely sure what to do about it in cost_incremental_sort. > The comment (introduced by 89deca582a in 2017) suggests a proper fix > would require redesigning the parsetree representation of setops, and > it's a bit too late for that. > > So I wonder what a possible solution might look like. I was hoping we > might grab the original target list and use that, similarly to > recurse_set_operations, but I'm not sure how/where to get it. This is also not an area I'm familiar with. Reading through the prepunion.c code alongside cost_incremental_sort, it seems that we don't have access to the same level of information as the prepunion code (i.e., we're only looking at the result of the union, not the components of it), and trying descend down into it seems even more gross, so, see below... > Another option is to use something as simple as checking for Vars with > varno==0 in cost_incremental_sort() and ignoring them somehow. We could > simply use some arbitrary estimate - by assuming the rows are unique or > something like that. Yes, I agree it's pretty ugly and I'd much rather > find a way to generate something sensible, but I'm not even sure we can > generate good estimate when doing UNION of data from different relations > and so on. Th
Re: documenting the backup manifest file format
On 2020-Apr-14, David Steele wrote: > On 4/14/20 12:56 PM, Robert Haas wrote: > > > Hmm, did David suggest that before? I don't recall for sure. I think > > he had some suggestion, but I'm not sure if it was the same one. > > "I'm also partial to using epoch time in the manifest because it is > generally easier for programs to work with. But, human-readable doesn't > suck, either." Ugh. If you go down that road, why write human-readable contents at all? You may as well just use a binary format. But that's a very slippery slope and you won't like to be in the bottom -- I don't see what that gains you. It's not like it's a lot of work to parse a timestamp in a non-internationalized well-defined human-readable format. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: documenting the backup manifest file format
On 4/14/20 1:27 PM, Alvaro Herrera wrote: On 2020-Apr-14, David Steele wrote: On 4/14/20 12:56 PM, Robert Haas wrote: Hmm, did David suggest that before? I don't recall for sure. I think he had some suggestion, but I'm not sure if it was the same one. "I'm also partial to using epoch time in the manifest because it is generally easier for programs to work with. But, human-readable doesn't suck, either." Ugh. If you go down that road, why write human-readable contents at all? You may as well just use a binary format. But that's a very slippery slope and you won't like to be in the bottom -- I don't see what that gains you. It's not like it's a lot of work to parse a timestamp in a non-internationalized well-defined human-readable format. Well, times are a special case because they are so easy to mess up. Try converting ISO-8601 to epoch time using the standard C functions on a system where TZ != UTC. Fun times. Regards, -- -David da...@pgmasters.net
Re: index paths and enable_indexscan
On Tue, Apr 14, 2020 at 10:21 AM Tom Lane wrote: > Awhile back I'd looked into getting rid of disable_cost altogether > by dint of not generating disabled paths. It's harder than it > sounds. We could perhaps change this particular case, but it's > not clear that there's any real benefit of making this one change > in isolation. I like the idea and have had the same thought before. I wondered whether we could arrange to generate paths for a rel and then if we end up with no paths, do it again ignoring the disable flags. It didn't seem entirely easy to rearrange things to work that way, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Do we need to handle orphaned prepared transactions in the server?
Thank you so much Bruce and Sawada. Bruce: I'll update the documentation with more details for max_age_prepared_xacts Sawada: You have raised 4 very valid points. Here are my thoughts. (1) I think your concern is that if we can reduce the need of 2 GUCs to one. The purpose of having 2 different GUCs was serving two different purposes. - max_age_prepared_xacts; this defines the maximum age of a prepared transaction after which it may be considered an orphan. - prepared_xacts_vacuum_warn_timeout; since we are throwing warnings in the log, we need a way of controlling the behaviour to prevent from flooding the log file with our messages. This timeout defines that. May be there is a better way of handling this, but may be not. (2) Your point is that when there are more than one prepared transactions (and even if the first one is removed), timeout prepared_xacts_vacuum_warn_timeout isn't always accurate. Yes, I agree. However, for us to hit the exact timeout for each prepared transaction, we need setup timers and callback functions. That's too much of an overhead IMHO. The alternative design that I took (the current design) is based on the assumption that we don't need a precise timeout for these transactions or for vacuum to report these issues to log. So, a decent enough way of setting a timeout should be good enough considering that it doesn't add any real overhead to the vacuum process. This does mean that an orphaned prepared transaction may be notified after prepared_xacts_vacuum_warn_timeout * 2. This, IMHO should be acceptable behavior. (3) Message is too detailed. Yes, I agree. I'll review this an update the patch. (4) GUCs should be renamed. Yes, I agree. The names you have suggested make more sense. I'll send an updated version of the patch with the new names and other suggested changes. On Wed, Mar 11, 2020 at 10:43 AM Masahiko Sawada < masahiko.saw...@2ndquadrant.com> wrote: > On Mon, 2 Mar 2020 at 21:42, Hamid Akhtar wrote: > > > > Here is the v2 of the same patch after rebasing it and running it > through pgindent. There are no other code changes. > > > > Thank you for working on this. I think what this patch tries to > achieve would be helpful to inform orphaned prepared transactions that > can be cause of inefficient vacuum to users. > > As far as I read the patch, the setting this feature using newly added > parameters seems to be complicated to me. IIUC, even if a prepared > transactions is enough older than max_age_prepared_xacts, we don't > warn if it doesn't elapsed prepared_xacts_vacuum_warn_timeout since > when the "first" prepared transaction is created. And the first > prepared transaction means that the first entry for > TwoPhaseStateData->prepXacts. Therefore, if there is always more than > one prepared transaction, we don't warn for at least > prepared_xacts_vacuum_warn_timeout seconds even if the first added > prepared transaction is already removed. So I'm not sure how we can > think the setting value of prepared_xacts_vacuum_warn_timeout. > > Regarding the warning message, I wonder if the current message is too > detailed. If we want to inform that there is orphaned prepared > transactions to users, it seems to me that it's enough to report the > existence (and possibly the number of orphaned prepared transactions), > rather than individual details. > > Given that the above things, we can simply think this feature; we can > have only max_age_prepared_xacts, and autovacuum checks the minimum of > prepared_at of prepared transactions, and compare it to > max_age_prepared_xacts. We can warn if (CurrentTimestamp - > min(prepared_at)) > max_age_prepared_xacts. In addition, if we also > want to control this behavior by the age of xid, we can have another > GUC parameter for comparing the age(min(xid of prepared transactions)) > to that value. > > Finally, regarding the name of parameters, when we mention the age of > transaction it means the age of xid of the transaction, not the time. > Please refer to other GUC parameter having "age" in its name such as > autovacuum_freeze_max_age, vacuum_freeze_min_age. The patch adds > max_age_prepared_xacts but I think it should be renamed. For example, > prepared_xact_warn_min_duration is for time and > prepared_xact_warn_max_age for age. > > Regards, > > -- > Masahiko Sawadahttp://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services > -- Highgo Software (Canada/China/Pakistan) URL : www.highgo.ca ADDR: 10318 WHALLEY BLVD, Surrey, BC CELL:+923335449950 EMAIL: mailto:hamid.akh...@highgo.ca SKYPE: engineeredvirus
Re: documenting the backup manifest file format
On 4/14/20 1:33 PM, David Steele wrote: > On 4/14/20 1:27 PM, Alvaro Herrera wrote: >> On 2020-Apr-14, David Steele wrote: >> >>> On 4/14/20 12:56 PM, Robert Haas wrote: >>> Hmm, did David suggest that before? I don't recall for sure. I think he had some suggestion, but I'm not sure if it was the same one. >>> >>> "I'm also partial to using epoch time in the manifest because it is >>> generally easier for programs to work with. But, human-readable >>> doesn't >>> suck, either." >> >> Ugh. If you go down that road, why write human-readable contents at >> all? You may as well just use a binary format. But that's a very >> slippery slope and you won't like to be in the bottom -- I don't see >> what that gains you. It's not like it's a lot of work to parse a >> timestamp in a non-internationalized well-defined human-readable format. > > Well, times are a special case because they are so easy to mess up. > Try converting ISO-8601 to epoch time using the standard C functions > on a system where TZ != UTC. Fun times. > > Even if it's a zulu time? That would be pretty damn sad. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Corruption during WAL replay
Thanks Kyotaro and Masahiko for the feedback. I think there is a consensus on the critical-section around truncate, but I just want to emphasize the need for reversing the order of the dropping the buffers and the truncation. Repro details (when full page write = off) 1) Page on disk has empty LP 1, Insert into page LP 1 2) checkpoint START (Recovery REDO eventually starts here) 3) Delete all rows on the page (page is empty now) 4) Autovacuum kicks in and truncates the pages DropRelFileNodeBuffers - Dirty page NOT written, LP 1 on disk still empty 5) Checkpoint completes 6) Crash 7) smgrtruncate - Not reached (this is where we do the physical truncate) Now the crash-recovery starts Delete-log-replay (above step-3) reads page with empty LP 1 and the delete fails with PANIC (old page on disk with no insert) Doing recovery, truncate is even not reached, a WAL replay of the truncation will happen in the future but the recovery fails (repeatedly) even before reaching that point. Best regards, Teja From: Kyotaro Horiguchi Sent: Monday, April 13, 2020 7:35 PM To: masahiko.saw...@2ndquadrant.com Cc: and...@anarazel.de ; tejesw...@hotmail.com ; pgsql-hack...@postgresql.org ; hexexp...@comcast.net Subject: Re: Corruption during WAL replay At Mon, 13 Apr 2020 18:53:26 +0900, Masahiko Sawada wrote in > On Mon, 13 Apr 2020 at 17:40, Andres Freund wrote: > > > > Hi, > > > > On 2020-04-13 15:24:55 +0900, Masahiko Sawada wrote: > > > On Sat, 11 Apr 2020 at 09:00, Teja Mupparti wrote: > > > /* > > > * We WAL-log the truncation before actually truncating, which means > > > * trouble if the truncation fails. If we then crash, the WAL replay > > > * likely isn't going to succeed in the truncation either, and cause a > > > * PANIC. It's tempting to put a critical section here, but that cure > > > * would be worse than the disease. It would turn a usually harmless > > > * failure to truncate, that might spell trouble at WAL replay, into a > > > * certain PANIC. > > > */ > > > > Yea, but that reasoning is just plain *wrong*. It's *never* ok to WAL > > log something and then not perform the action. This leads to to primary > > / replica getting out of sync, crash recovery potentially not completing > > (because of records referencing the should-be-truncated pages), ... It is introduced in 2008 by 3396000684, for 8.4. So it can be said as an overlook when introducing log-shipping. The reason other operations like INSERTs (that extends the underlying file) are "safe" after an extension failure is the following operations are performed in shared buffers as if the new page exists, then tries to extend the file again. So if we continue working after truncation failure, we need to disguise on shared buffers as if the truncated pages are gone. But we don't have a room for another flag in buffer header. For example, BM_DIRTY && !BM_VALID might be able to be used as the state that the page should have been truncated but not succeeded yet, but I'm not sure. Anyway, I think the prognosis of a truncation failure is far hopeless than extension failure in most cases and I doubt that it's good to introduce such a complex feature only to overcome such a hopeless situation. In short, I think we should PANIC in that case. > > > As a second idea, I wonder if we can defer truncation until commit > > > time like smgrDoPendingDeletes mechanism. The sequence would be: > > > > This is mostly an issue during [auto]vacuum partially truncating the end > > of the file. We intentionally release the AEL regularly to allow other > > accesses to continue. > > > > For transactional truncations we don't go down this path (as we create a > > new relfilenode). > > > > > > > At RelationTruncate(), > > > 1. WAL logging. > > > 2. Remember buffers to be dropped. > > > > You definitely cannot do that, as explained above. > > Ah yes, you're right. > > So it seems to me currently what we can do for this issue would be to > enclose the truncation operation in a critical section. IIUC it's not > enough just to reverse the order of dropping buffers and physical file > truncation because it cannot solve the problem of inconsistency on the > standby. And as Horiguchi-san mentioned, there is no need to reverse > that order if we envelop the truncation operation by a critical > section because we can recover page changes during crash recovery. The > strategy of writing out all dirty buffers before dropping buffers, > proposed as (a) in [1], also seems not enough. Agreed. Since it's not acceptable ether WAL-logging->not-performing nor performing->WAL-logging, there's no other way than working as if truncation is succeeded (and try again) even if it actually failed. But it would be too complex. Just making it a critical section seems the right thing here. > [1] > https://www.postgresql.org/message-id/201912070012
Re: pg_basebackup, manifests and backends older than ~12
On Mon, Apr 13, 2020 at 8:23 PM Michael Paquier wrote: > On Mon, Apr 13, 2020 at 07:55:07PM -0400, Robert Haas wrote: > > Oh, hmm. Maybe I'm getting confused with a previous version of the > > patch that behaved differently. > > No problem. If you prefer keeping this part of the code, that's fine > by me. If you think that the patch is suited as-is, including > silencing the error forcing to use --no-manifest on server versions > older than v13, I am fine to help out and apply it myself, but I am > also fine if you wish to take care of it by yourself. Feel free to go ahead. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Do we need to handle orphaned prepared transactions in the server?
On Wed, Feb 19, 2020 at 10:05 AM Hamid Akhtar wrote: > Attached is version 1 of POC patch for notifying of orphaned > prepared transactions via warnings emitted to a client > application and/or log file. It applies to PostgreSQL branch > "master" on top of "e2e02191" commit. I think this is a bad idea and that we should reject the patch. It's true that forgotten prepared transactions are a problem, but it's also true that you can monitor for that yourself using the pg_prepared_xacts view. If you do, you will have a lot more flexibility than this patch gives you, or than any similar patch ever can give you. Generally, people don't pay attention to warnings in logs, so they're just clutter. Moreover, there are tons of different things for which you should probably monitor (wraparound perils, slow checkpoints, bloated tables, etc.) and so the real solution is to run some monitoring software. So even if you do pay attention to your logs, and even if the GUCs this provides you sufficient flexibility for your needs in this one area, you still need to run some monitoring software. At which point, you don't also need this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: documenting the backup manifest file format
On 4/14/20 3:03 PM, Andrew Dunstan wrote: On 4/14/20 1:33 PM, David Steele wrote: On 4/14/20 1:27 PM, Alvaro Herrera wrote: On 2020-Apr-14, David Steele wrote: On 4/14/20 12:56 PM, Robert Haas wrote: Hmm, did David suggest that before? I don't recall for sure. I think he had some suggestion, but I'm not sure if it was the same one. "I'm also partial to using epoch time in the manifest because it is generally easier for programs to work with. But, human-readable doesn't suck, either." Ugh. If you go down that road, why write human-readable contents at all? You may as well just use a binary format. But that's a very slippery slope and you won't like to be in the bottom -- I don't see what that gains you. It's not like it's a lot of work to parse a timestamp in a non-internationalized well-defined human-readable format. Well, times are a special case because they are so easy to mess up. Try converting ISO-8601 to epoch time using the standard C functions on a system where TZ != UTC. Fun times. Even if it's a zulu time? That would be pretty damn sad. ZULU/GMT/UTC are all fine. But if the server timezone is EDT for example (not that I recommend this) you are likely to get the wrong result. Results vary based on your platform. For instance, we found MacOS was more likely to work the way you would expect and Linux was hopeless. There are all kinds of fun tricks to get around this (sort of). One is to temporarily set TZ=UTC which sucks if an error happens before it gets set back. There are some hacks to try to determine your offset which have inherent race conditions around DST changes. After some experimentation we just used the Posix definition for epoch time and used that to do our conversions: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_16 Regards, -- -David da...@pgmasters.net
Re: Parallel copy
Hello, I was going through some literatures on parsing CSV files in a fully parallelized way and found (from [1]) an interesting approach implemented in the open-source project ParaText[2]. The algorithm follows a two-phase approach: the first pass identifies the adjusted chunks in parallel by exploiting the simplicity of CSV formats and the second phase processes complete records within each adjusted chunk by one of the available workers. Here is the sketch: 1. Each worker scans a distinct fixed sized chunk of the CSV file and collects the following three stats from the chunk: a) number of quotes b) position of the first new line after even number of quotes c) position of the first new line after odd number of quotes 2. Once stats from all the chunks are collected, the leader identifies the adjusted chunk boundaries by iterating over the stats linearly: - For the k-th chunk, the leader adds the number of quotes in k-1 chunks. - If the number is even, then the k-th chunk does not start in the middle of a quoted field, and the first newline after an even number of quotes (the second collected information) is the first record delimiter in this chunk. - Otherwise, if the number is odd, the first newline after an odd number of quotes (the third collected information) is the first record delimiter. - The end position of the adjusted chunk is obtained based on the starting position of the next adjusted chunk. 3. Once the boundaries of the chunks are determined (forming adjusted chunks), individual worker may take up one adjusted chunk and process the tuples independently. Although this approach parses the CSV in parallel, it requires two scan on the CSV file. So, given a system with spinning hard-disk and small RAM, as per my understanding, the algorithm will perform very poorly. But, if we use this algorithm to parse a CSV file on a multi-core system with a large RAM, the performance might be improved significantly [1]. Hence, I was trying to think whether we can leverage this idea for implementing parallel COPY in PG. We can design an algorithm similar to parallel hash-join where the workers pass through different phases. 1. Phase 1 - Read fixed size chunks in parallel, store the chunks and the small stats about each chunk in the shared memory. If the shared memory is full, go to phase 2. 2. Phase 2 - Allow a single worker to process the stats and decide the actual chunk boundaries so that no tuple spans across two different chunks. Go to phase 3. 3. Phase 3 - Each worker picks one adjusted chunk, parse and process tuples from the same. Once done with one chunk, it picks the next one and so on. 4. If there are still some unread contents, go back to phase 1. We can probably use separate workers for phase 1 and phase 3 so that they can work concurrently. Advantages: 1. Each worker spends some significant time in each phase. Gets benefit of the instruction cache - at least in phase 1. 2. It also has the same advantage of parallel hash join - fast workers get to work more. 3. We can extend this solution for reading data from STDIN. Of course, the phase 1 and phase 2 must be performed by the leader process who can read from the socket. Disadvantages: 1. Surely doesn't work if we don't have enough shared memory. 2. Probably, this approach is just impractical for PG due to certain limitations. Thoughts? [1] https://www.microsoft.com/en-us/research/uploads/prod/2019/04/chunker-sigmod19.pdf [2] ParaText. https://github.com/wiseio/paratext. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
Re: documenting the backup manifest file format
On 4/14/20 3:19 PM, David Steele wrote: > On 4/14/20 3:03 PM, Andrew Dunstan wrote: >> >> On 4/14/20 1:33 PM, David Steele wrote: >>> On 4/14/20 1:27 PM, Alvaro Herrera wrote: On 2020-Apr-14, David Steele wrote: > On 4/14/20 12:56 PM, Robert Haas wrote: > >> Hmm, did David suggest that before? I don't recall for sure. I think >> he had some suggestion, but I'm not sure if it was the same one. > > "I'm also partial to using epoch time in the manifest because it is > generally easier for programs to work with. But, human-readable > doesn't > suck, either." Ugh. If you go down that road, why write human-readable contents at all? You may as well just use a binary format. But that's a very slippery slope and you won't like to be in the bottom -- I don't see what that gains you. It's not like it's a lot of work to parse a timestamp in a non-internationalized well-defined human-readable format. >>> >>> Well, times are a special case because they are so easy to mess up. >>> Try converting ISO-8601 to epoch time using the standard C functions >>> on a system where TZ != UTC. Fun times. >> >> Even if it's a zulu time? That would be pretty damn sad. > ZULU/GMT/UTC are all fine. But if the server timezone is EDT for > example (not that I recommend this) you are likely to get the wrong > result. Results vary based on your platform. For instance, we found > MacOS was more likely to work the way you would expect and Linux was > hopeless. > > There are all kinds of fun tricks to get around this (sort of). One is > to temporarily set TZ=UTC which sucks if an error happens before it > gets set back. There are some hacks to try to determine your offset > which have inherent race conditions around DST changes. > > After some experimentation we just used the Posix definition for epoch > time and used that to do our conversions: > > https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_16 > > > OK, but I think if we're putting a timestamp string in ISO-8601 format in the manifest it should be in UTC / Zulu time, precisely to avoid these issues. If that's too much trouble then yes an epoch time will probably do. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: documenting the backup manifest file format
On 4/14/20 3:55 PM, Andrew Dunstan wrote: OK, but I think if we're putting a timestamp string in ISO-8601 format in the manifest it should be in UTC / Zulu time, precisely to avoid these issues. If that's too much trouble then yes an epoch time will probably do. Happily ISO-8601 is always UTC. The problem I'm referring to is the timezone setting on the host system when doing conversions in C. To be fair most languages handle this well and C is C so I'm not sure we need to make a big deal of it. In JSON/XML it's pretty common to use ISO-8601 so that seems like a rational choice. Regards, -- -David da...@pgmasters.net
Re: documenting the backup manifest file format
On 2020-Apr-14, Andrew Dunstan wrote: > OK, but I think if we're putting a timestamp string in ISO-8601 format > in the manifest it should be in UTC / Zulu time, precisely to avoid > these issues. If that's too much trouble then yes an epoch time will > probably do. The timestamp is always specified and always UTC (except the code calls it GMT). + /* +* Convert last modification time to a string and append it to the +* manifest. Since it's not clear what time zone to use and since time +* zone definitions can change, possibly causing confusion, use GMT +* always. +*/ + appendStringInfoString(&buf, "\"Last-Modified\": \""); + enlargeStringInfo(&buf, 128); + buf.len += pg_strftime(&buf.data[buf.len], 128, "%Y-%m-%d %H:%M:%S %Z", + pg_gmtime(&mtime)); + appendStringInfoString(&buf, "\""); I was merely saying that it's trivial to make this iso-8601 compliant as buf.len += pg_strftime(&buf.data[buf.len], 128, "%Y-%m-%dT%H:%M:%SZ", ie. omit the "GMT" string and replace it with a literal Z, and remove the space and replace it with a T. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: documenting the backup manifest file format
On 2020-Apr-14, David Steele wrote: > Happily ISO-8601 is always UTC. Uh, it is not -- https://en.wikipedia.org/wiki/ISO_8601#Time_zone_designators -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Race condition in SyncRepGetSyncStandbysPriority
I wrote: > It doesn't seem to me to be that hard to implement the desired > semantics for synchronous_standby_names with inconsistent info. > In FIRST mode you basically just need to take the N smallest > priorities you see in the array, but without assuming there are no > duplicates or holes. It might be a good idea to include ties at the > end, that is if you see 1,2,2,4 or 1,3,3,4 and you want 2 sync > standbys, include the first three of them in the calculation until > the inconsistency is resolved. In ANY mode I don't see that > inconsistent priorities matter at all. Concretely, I think we ought to do the attached, or something pretty close to it. I'm not really happy about breaking ties based on walsnd_index, but I see that there are several TAP test cases that fail if we do something else. I'm inclined to think those tests are bogus ... but I won't argue to change them right now. regards, tom lane diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c index ffd5b31..c66c371 100644 --- a/src/backend/replication/syncrep.c +++ b/src/backend/replication/syncrep.c @@ -108,14 +108,17 @@ static bool SyncRepGetSyncRecPtr(XLogRecPtr *writePtr, static void SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr, XLogRecPtr *applyPtr, - List *sync_standbys); + SyncRepStandbyData *sync_standbys, + int num_standbys); static void SyncRepGetNthLatestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr, XLogRecPtr *applyPtr, - List *sync_standbys, uint8 nth); + SyncRepStandbyData *sync_standbys, + int num_standbys, + uint8 nth); static int SyncRepGetStandbyPriority(void); -static List *SyncRepGetSyncStandbysPriority(bool *am_sync); -static List *SyncRepGetSyncStandbysQuorum(bool *am_sync); +static void SyncRepGetSyncStandbysPriority(SyncRepStandbyData *standbys, int n); +static int standby_priority_comparator(const void *a, const void *b); static int cmp_lsn(const void *a, const void *b); #ifdef USE_ASSERT_CHECKING @@ -406,9 +409,10 @@ SyncRepInitConfig(void) priority = SyncRepGetStandbyPriority(); if (MyWalSnd->sync_standby_priority != priority) { - LWLockAcquire(SyncRepLock, LW_EXCLUSIVE); + SpinLockAcquire(&MyWalSnd->mutex); MyWalSnd->sync_standby_priority = priority; - LWLockRelease(SyncRepLock); + SpinLockRelease(&MyWalSnd->mutex); + ereport(DEBUG1, (errmsg("standby \"%s\" now has synchronous standby priority %u", application_name, priority))); @@ -523,8 +527,6 @@ SyncRepReleaseWaiters(void) /* * Calculate the synced Write, Flush and Apply positions among sync standbys. * - * The caller must hold SyncRepLock. - * * Return false if the number of sync standbys is less than * synchronous_standby_names specifies. Otherwise return true and * store the positions into *writePtr, *flushPtr and *applyPtr. @@ -536,27 +538,43 @@ static bool SyncRepGetSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr, XLogRecPtr *applyPtr, bool *am_sync) { - List *sync_standbys; - - Assert(LWLockHeldByMe(SyncRepLock)); + SyncRepStandbyData *sync_standbys; + int num_standbys; + int i; + /* Initialize default results */ *writePtr = InvalidXLogRecPtr; *flushPtr = InvalidXLogRecPtr; *applyPtr = InvalidXLogRecPtr; *am_sync = false; + /* Quick out if not even configured to be synchronous */ + if (SyncRepConfig == NULL) + return false; + /* Get standbys that are considered as synchronous at this moment */ - sync_standbys = SyncRepGetSyncStandbys(am_sync); + num_standbys = SyncRepGetSyncStandbys(&sync_standbys); + + /* Am I among the candidate sync standbys? */ + for (i = 0; i < num_standbys; i++) + { + if (sync_standbys[i].is_me) + { + *am_sync = sync_standbys[i].is_sync_standby; + break; + } + } /* - * Quick exit if we are not managing a sync standby or there are not - * enough synchronous standbys. + * Nothing more to do if we are not managing a sync standby or there are + * not enough synchronous standbys. (Note: if there are least num_sync + * candidates, then at least num_sync of them will be marked as + * is_sync_standby; we don't need to count them here.) */ if (!(*am_sync) || - SyncRepConfig == NULL || - list_length(sync_standbys) < SyncRepConfig->num_sync) + num_standbys < SyncRepConfig->num_sync) { - list_free(sync_standbys); + pfree(sync_standbys); return false; } @@ -576,15 +594,16 @@ SyncRepGetSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr, if (SyncRepConfig->syncrep_method == SYNC_REP_PRIORITY) { SyncRepGetOldestSyncRecPtr(writePtr, flushPtr, applyPtr, - sync_standbys); + sync_standbys, num_standbys); } else { SyncRepGetNthLatestSyncRecPtr(writePtr, flushPtr, applyPtr, - sync_standbys, SyncRepConfig->num_sync); + sync_standbys, num_standbys, + S
Re: documenting the backup manifest file format
On 4/14/20 4:09 PM, Alvaro Herrera wrote: > On 2020-Apr-14, Andrew Dunstan wrote: > >> OK, but I think if we're putting a timestamp string in ISO-8601 format >> in the manifest it should be in UTC / Zulu time, precisely to avoid >> these issues. If that's too much trouble then yes an epoch time will >> probably do. > The timestamp is always specified and always UTC (except the code calls > it GMT). > > + /* > +* Convert last modification time to a string and append it to the > +* manifest. Since it's not clear what time zone to use and since time > +* zone definitions can change, possibly causing confusion, use GMT > +* always. > +*/ > + appendStringInfoString(&buf, "\"Last-Modified\": \""); > + enlargeStringInfo(&buf, 128); > + buf.len += pg_strftime(&buf.data[buf.len], 128, "%Y-%m-%d %H:%M:%S %Z", > + pg_gmtime(&mtime)); > + appendStringInfoString(&buf, "\""); > > I was merely saying that it's trivial to make this iso-8601 compliant as > > buf.len += pg_strftime(&buf.data[buf.len], 128, "%Y-%m-%dT%H:%M:%SZ", > > ie. omit the "GMT" string and replace it with a literal Z, and remove > the space and replace it with a T. > +1 cheers andre -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: documenting the backup manifest file format
On 4/14/20 4:11 PM, Alvaro Herrera wrote: On 2020-Apr-14, David Steele wrote: Happily ISO-8601 is always UTC. Uh, it is not -- https://en.wikipedia.org/wiki/ISO_8601#Time_zone_designators Whoops, you are correct. I've just never seen non-UTC in the wild yet. -- -David da...@pgmasters.net
Re: cleaning perl code
On 2020-Apr-14, Andrew Dunstan wrote: > One of the things that's a bit sad is that perlcritic doesn't generally > let you apply policies to a given set of files or files matching some > pattern. It would be nice, for instance, to be able to apply some > additional standards to strategic library files like PostgresNode.pm, > TestLib.pm and Catalog.pm. There are good reasons as suggested upthread > to apply higher standards to library files than to, say, a TAP test > script. The only easy way I can see to do that would be to have two > different perlcriticrc files and adjust pgperlcritic to make two runs. > If people think that's worth it I'll put a little work into it. If not, > I'll just leave things here. I think being more strict about it in strategic files (I'd say that's Catalog.pm plus src/test/perl/*.pm) might be a good idea. Maybe give it a try and see what comes up. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: WIP/PoC for parallel backup
On Tue, Apr 14, 2020 at 10:37 AM Asif Rehman wrote: > I forgot to make a check for no-manifest. Fixed. Attached is the updated > patch. +typedef struct +{ ... +} BackupFile; + +typedef struct +{ ... +} BackupState; These structures need comments. +list_wal_files_opt_list: + SCONST SCONST { - $$ = makeDefElem("manifest_checksums", - (Node *)makeString($2), -1); + $$ = list_make2( + makeDefElem("start_wal_location", + (Node *)makeString($2), -1), + makeDefElem("end_wal_location", + (Node *)makeString($2), -1)); + } This seems like an unnecessarily complicated parse representation. The DefElems seem to be completely unnecessary here. @@ -998,7 +1110,37 @@ SendBaseBackup(BaseBackupCmd *cmd) set_ps_display(activitymsg); } - perform_base_backup(&opt); + switch (cmd->cmdtag) So the design here is that SendBaseBackup() is now going to do a bunch of things that are NOT sending a base backup? With no updates to the comments of that function and no change to the process title it sets? - return (manifest->buffile != NULL); + return (manifest && manifest->buffile != NULL); Heck no. It appears that you didn't even bother reading the function header comment. + * Send a single resultset containing XLogRecPtr record (in text format) + * TimelineID and backup label. */ static void -SendXlogRecPtrResult(XLogRecPtr ptr, TimeLineID tli) +SendXlogRecPtrResult(XLogRecPtr ptr, TimeLineID tli, +StringInfo label, char *backupid) This just casually breaks wire protocol compatibility, which seems completely unacceptable. + if (strlen(opt->tablespace) > 0) + sendTablespace(opt->tablespace, NULL, true, NULL, &files); + else + sendDir(".", 1, true, NIL, true, NULL, NULL, &files); + + SendFilesHeader(files); So I guess the idea here is that we buffer the entire list of files in memory, regardless of size, and then we send it out afterwards. That doesn't seem like a good idea. The list of files might be very large. We probably need some code refactoring here rather than just piling more and more different responsibilities onto sendTablespace() and sendDir(). + if (state->parallel_mode) + SpinLockAcquire(&state->lock); + + state->throttling_counter += increment; + + if (state->parallel_mode) + SpinLockRelease(&state->lock); I don't like this much. It seems to me that we would do better to use atomics here all the time, instead of conditional spinlocks. +static void +send_file(basebackup_options *opt, char *file, bool missing_ok) ... + if (file == NULL) + return; That seems totally inappropriate. + sendFile(file, file + basepathlen, &statbuf, true, InvalidOid, NULL, NULL); Maybe I'm misunderstanding, but this looks like it's going to write a tar header, even though we're not writing a tarfile. + else + ereport(WARNING, + (errmsg("skipping special file or directory \"%s\"", file))); So, if the user asks for a directory or symlink, what's going to happen is that they're going to receive an empty file, and get a warning. That sounds like terrible behavior. + /* +* Check for checksum failures. If there are failures across multiple +* processes it may not report total checksum count, but it will error +* out,terminating the backup. +*/ In other words, the patch breaks the feature. Not that the feature in question works particularly well as things stand, but this makes it worse. I think this patch (0003) is in really bad shape. I'm having second thoughts about the design, but it's kind of hard to even have a discussion about the design when the patch is riddled with minor problems like inadequate comments, failure to update existing comments, and breaking a bunch of things. I understand that sometimes things get missed, but this is version 14 of a patch that's been kicking around since last August. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: ERROR: could not determine which collation to use for string comparison
Andreas Joseph Krogh writes: > Guys; This errors out with: > ERROR: could not determine which collation to use for string comparison > HINT: Use the COLLATE clause to set the collation explicitly. Fixed, thanks for the report. regards, tom lane
Re: Poll: are people okay with function/operator table redesign?
On Mon, Apr 13, 2020 at 10:13 AM Tom Lane wrote: > As discussed in the thread at [1], I've been working on redesigning > the tables we use to present SQL functions and operators. The > first installment of that is now up; see tables 9.30 and 9.31 at > > https://www.postgresql.org/docs/devel/functions-datetime.html > > and table 9.33 at > > https://www.postgresql.org/docs/devel/functions-enum.html > > As I write this the enum headers are centered horizontally while the datetime ones are left aligned. The centering doesn't do it for me. To much gap and the data itself is not centered so there is a large disconnected between the header and the value. The run-on aspect of the left-aligned setup is of some concern but maybe just adding some left padding to the second column - and right padding to the first - can provide the desired negative space without adding so much as to break usability. (gonna use embedded images here...) [image: image.png] [image: image.png] David J.
Re: wrong relkind error messages
On 2020-Apr-14, Michael Paquier wrote: > On Mon, Apr 13, 2020 at 11:13:15AM -0400, Tom Lane wrote: > > ERROR: cannot define statistics for relation "ti" > > DETAIL: This operation is not supported for indexes. > > > > which still leaves implicit that "ti" is an index, but probably that's > > something the user can figure out. > > > > Maybe someone else can do better? > > "This operation is not supported for put_relkind_here \"%s\"."? I > think that it is better to provide a relation name in the error > message (even optionally a namespace). That's less to guess for the > user. But the relation name is already in the ERROR line -- why do you care so much about also having it in the DETAIL? Besides, I think part of the point Tom was making is that if you say "not supported for the index foo" is that the user is left wondering whether the operation is not supported for that particular index only or for any index. Tom's other proposal > > DETAIL: "ti" is an index, and this operation is not supported for that > > kind of relation. addresses that problem, but seems excessively verbose. Also, elsewhere Peter said[1] that we should not try to list the things that would be allowed, so it's pointless to try to list the relkinds for which the operation is permissible. So I +1 this idea: ERROR: cannot define statistics for relation "ti" DETAIL: This operation is not supported for indexes. [1] https://www.postgresql.org/message-id/1293803569.19789.6.camel%40fsopti579.F-Secure.com -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: pg_basebackup, manifests and backends older than ~12
On Tue, Apr 14, 2020 at 03:13:39PM -0400, Robert Haas wrote: > Feel free to go ahead. Thanks, let's do it then. If you have any objections about any parts of the patch, of course please feel free. -- Michael signature.asc Description: PGP signature
Re: pg_basebackup, manifests and backends older than ~12
On Mon, Apr 13, 2020 at 07:04:20PM -0400, Alvaro Herrera wrote: > I agree, I think forcing users to specify --no-manifest when run on old > servers will cause users to write bad scripts; I vote for silently > disabling checksums. Okay, thanks. Are there any other opinions? -- Michael signature.asc Description: PGP signature
Re: wrong relkind error messages
On 2020-Apr-13, Robert Haas wrote: > + ereport(ERROR, > + (errcode(ERRCODE_WRONG_OBJECT_TYPE), > + errmsg("action cannot be performed on relation \"%s\"", > + RelationGetRelationName(rel)), > > Super-vague. Maybe, but note that the patch proposed to replace this current error message: ERROR: foo is not an index or foreign table with ERROR: action cannot be performed on "foo" DETAIL: "foo" is a materialized view. or, if we're to adopt Tom's proposed wording, ERROR: cannot perform action on relation "ti" DETAIL: This operation is not supported for materialized views. so it's not like this is making things any worse; the error was already super-vague. This could be improved if we had stringification of ALTER TABLE subcommand types: ERROR: ALTER TABLE ... ADD COLUMN cannot be performed on "foo" DETAIL: "foo" is a gummy bear. or ERROR: ALTER TABLE ... ADD COLUMN cannot be performed on foo DETAIL: This action cannot be performed on gummy bears. but that seems material for a different patch. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: pg_basebackup, manifests and backends older than ~12
Michael Paquier writes: > On Mon, Apr 13, 2020 at 07:04:20PM -0400, Alvaro Herrera wrote: >> I agree, I think forcing users to specify --no-manifest when run on old >> servers will cause users to write bad scripts; I vote for silently >> disabling checksums. > Okay, thanks. Are there any other opinions? FWIW, I concur with silently disabling the feature if the source server can't support it. regards, tom lane
Re: wrong relkind error messages
Alvaro Herrera writes: > On 2020-Apr-13, Robert Haas wrote: >> + ereport(ERROR, >> + (errcode(ERRCODE_WRONG_OBJECT_TYPE), >> + errmsg("action cannot be performed on relation \"%s\"", >> + RelationGetRelationName(rel)), >> >> Super-vague. > Maybe, but note that the patch proposed to replace this current error > message: > ERROR: foo is not an index or foreign table > ... > so it's not like this is making things any worse; the error was already > super-vague. Yeah. I share Robert's feeling that "action" is not really desirable here, but I have to concur that this is an improvement on the existing text, which also fails to mention what command is being rejected. > This could be improved if we had stringification of ALTER TABLE > subcommand types: > ERROR: ALTER TABLE ... ADD COLUMN cannot be performed on "foo" In the meantime could we at least say "ALTER TABLE action cannot be performed"? The worst aspect of the existing text is that if an error comes out of a script with a lot of different commands, it doesn't give you any hint at all about which command failed. regards, tom lane
Re: [PATCH] Keeps tracking the uniqueness with UniqueKey
Hi David: Thanks for your time. > 1. Out of date comment in join.sql > > -- join removal is not possible when the GROUP BY contains a column that is > -- not in the join condition. (Note: as of 9.6, we notice that b.id is a > -- primary key and so drop b.c_id from the GROUP BY of the resulting plan; > -- but this happens too late for join removal in the outer plan level.) > explain (costs off) > select d.* from d left join (select d, c_id from b group by b.d, b.c_id) s > on d.a = s.d; > > You've changed the GROUP BY clause so it does not include b.id, so the > Note in the comment is now misleading. > Thanks, I will fix this one in the following patch. > > 2. I think 0002 is overly restrictive in its demands that > parse->hasAggs must be false. We should be able to just use a Group > Aggregate with unsorted input when the input_rel is unique on the > GROUP BY clause. This will save on hashing and sorting. Basically > similar to what we do for when a query contains aggregates without any > GROUP BY. > > Yes, This will be a perfect result, the difficult is the current aggregation function execution is highly coupled with Agg node(ExecInitAgg) which is removed in the unique case. I ever make the sum (w/o finalfn) and avg(with finalfn) works in a hack way, but still many stuffs is not handled. Let me prepare the code for this purpose in 1~2 days to see if I'm going with the right direction. Ashutosh also has an idea[1] that if the relation underlying an Agg node is known to be unique for given groupByClause, we could safely use AGG_SORTED strategy. Though the input is not ordered, it's sorted thus for every row Agg node will combine/finalize the aggregate result. I will target the perfect result first and see how many effort do we need, if not, I will try Ashutosh's suggestion. > 3. I don't quite understand why you changed this to a right join: > > -- Test case where t1 can be optimized but not t2 > explain (costs off) select t1.*,t2.x,t2.z > -from t1 inner join t2 on t1.a = t2.x and t1.b = t2.y > +from t1 right join t2 on t1.a = t2.x and t1.b = t2.y > > Perhaps this change is left over from some previous version of the patch? > This is on purpose. the original test case is used to test we can short the group key for t1 but not t2 for aggregation, but if I keep the inner join, the aggnode will be removed totally, so I have to change it to right join in order to keep the aggnode. The full test case is: -- Test case where t1 can be optimized but not t2 explain (costs off) select t1.*,t2.x,t2.z from t1 inner join t2 on t1.a = t2.x and t1.b = t2.y group by t1.a,t1.b,t1.c,t1.d,t2.x,t2.z; where (a, b) is the primary key of t1. [1] https://www.postgresql.org/message-id/CAExHW5sY%2BL6iZ%3DrwnL7n3jET7aNLCNQimvfcS7C%2B5wmdjmdPiw%40mail.gmail.com
Re: Perl modules for testing/viewing/corrupting/repairing your heap files
Not having received any feedback on this, I've dusted the modules off for submission as-is. v1-0001-Adding-HeapFile-related-perl-modules.patch Description: Binary data — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Perl modules for testing/viewing/corrupting/repairing your heap files
On Wed, Apr 8, 2020 at 3:51 PM Mark Dilger wrote: > Recently, as part of testing something else, I had need of a tool to create > surgically precise corruption within heap pages. I wanted to make the > corruption from within TAP tests, so I wrote the tool as a set of perl > modules. There is also pg_hexedit: https://github.com/petergeoghegan/pg_hexedit -- Peter Geoghegan
Re: Race condition in SyncRepGetSyncStandbysPriority
At Tue, 14 Apr 2020 09:52:42 -0400, Tom Lane wrote in > Kyotaro Horiguchi writes: > > SyncRepGetSyncStandbysPriority() is runing holding SyncRepLock so > > sync_standby_priority of any walsender can be changed while the > > function is scanning welsenders. The issue is we already have > > inconsistent walsender information before we enter the function. Thus > > how many times we scan on the array doesn't make any difference. > > *Yes it does*. The existing code can deliver entirely broken results > if some walsender exits between where we examine the priorities and > where we fetch the WAL pointers. While that doesn't seem to be the Ah. I didn't take that as inconsistency. Actually walsender exit inactivate the corresponding slot by setting pid = 0. In a bad case (as you mentioned upthread) the entry can be occupied by another wal sender. However, sync_standby_priority cannot be updated until the whole work is finished. > exact issue we're seeing in the buildfarm, it's still another obvious > bug in this code. I will not accept a "fix" that doesn't fix that. I think that the "inconsistency" that can be observed in a process is disagreement between SyncRepConfig->nmembers and ->sync_standby_priority. If any one of walsenders regards its priority as lower (larger in value) than nmembers in the "current" process, the assertion fires. If that is the issue, the issue is not dynamic inconsistency. # It's the assumption of my band-aid. > > I think we need to do one of the followings. > > > A) prevent SyncRepGetSyncStandbysPriority from being entered while > > walsender priority is inconsistent. > > B) make SyncRepGetSyncStandbysPriority be tolerant of priority > > inconsistency. > > C) protect walsender priority array from beinig inconsistent. > > (B) seems like the only practical solution from here. We could > probably arrange for synchronous update of the priorities when > they change in response to a GUC change, but it doesn't seem to > me to be practical to do that in response to walsender exit. > You'd end up finding that an unexpected walsender exit results > in panic'ing the system, which is no better than where we are now. I agree to you as a whole. I thought of several ways to keep the consistency of the array but all of them looked too much. > It doesn't seem to me to be that hard to implement the desired > semantics for synchronous_standby_names with inconsistent info. > In FIRST mode you basically just need to take the N smallest > priorities you see in the array, but without assuming there are no > duplicates or holes. It might be a good idea to include ties at the > end, that is if you see 1,2,2,4 or 1,3,3,4 and you want 2 sync > standbys, include the first three of them in the calculation until > the inconsistency is resolved. In ANY mode I don't see that > inconsistent priorities matter at all. Mmm, the priority lists like 1,2,2,4 are not thought as inconsistency at all in the context of walsender priority. That happenes stablly if any two or more walreceivers reported the same application_name. I believe the existing code is already taking that case into consideration. > > If we accept to share variable-length information among processes, > > sharing sync_standby_names or parsed SyncRepConfigData among processes > > would work. > > Not sure that we really need more than what's being shared now, > ie each process's last-known index in the sync_standby_names list. If we take the (B), we don't need anymore. (A) and (C) would need more. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
remove_useless_groupby_columns does not need to record constraint dependencies
Hi, Over in [1], Tom and I had a discussion in response to some confusion about why remove_useless_groupby_columns() goes to the trouble of recording a dependency on the PRIMARY KEY constraint when removing surplus columns from the GROUP BY clause. The outcome was that we don't need to do this since remove_useless_groupby_columns() is used only as a plan-time optimisation, we don't need to record any dependency. Unlike check_functional_grouping(), where we must record the dependency as we may end up with a VIEW with columns, e.g, in the select list which are functionally dependant on a pkey constraint. In that case, we must ensure the view is also removed, or that the constraint removal is blocked. There's no such requirement for planner smarts, such as the one in remove_useless_groupby_columns() as in that case we'll trigger a relcache invalidation during ALTER TABLE DROP CONSTRAINT, which cached plans will notice when they obtain their locks just before execution begins. To prevent future confusion, I'd like to remove dependency recording code from remove_useless_groupby_columns() and update the misleading comment. Likely this should also be backpatched to 9.6. Does anyone think differently? A patch to do this is attached. [1] https://www.postgresql.org/message-id/caaphdvr4ow_oud_rxp0d1hrgz+a4mm8+8ur7qom2vqkfx08...@mail.gmail.com fix_remove_useless_groupby_columns.patch Description: Binary data
Re: where should I stick that backup?
Hi, On 2020-04-14 11:38:03 -0400, Robert Haas wrote: > I'm fairly deeply uncomfortable with what Andres is proposing. I see > that it's very powerful, and can do a lot of things, and that if > you're building something that does sophisticated things with storage, > you probably want an API like that. It does a great job making > complicated things possible. However, I feel that it does a lousy job > making simple things simple. I think it's pretty much exactly the opposite. Your approach seems to move all the complexity to the user, having to build entire combination of commands themselves. Instead of having one or two default commands that do backups in common situations, everyone has to assemble them from pieces. Moved from later in your email, since it seems to make more sense to have it here: > All they're going to see is that they can use gzip and maybe lz4 > because we provide the necessary special magic tools to integrate with > those, but for some reason we don't have a special magic tool that > they can use with their own favorite compressor, and so they can't use > it. I think people are going to find that fairly unhelpful. I have no problem with providing people with the opportunity to use their personal favorite compressor, but forcing them to have to do that, and to ensure it's installed etc, strikes me as a spectacurly bad default situation. Most people don't have the time to research which compression algorithms work the best for which precise situation. How do you imagine a default scripted invocation of the new backup stuff to look like? Having to specify multiple commandline "fragments" for compression, storing files, ... can't be what we want the common case should look like. It'll just again lead to everyone copy & pasting examples that all are wrong in different ways. They'll not at all work across platforms (or often not across OS versions). In general, I think it's good to give expert users the ability to customize things like backups and archiving. But defaulting to every non-expert user having to all that expert work (or coyping it from bad examples) is one of the most user hostile things in postgres. > Also, I don't really see what's wrong with the server forking > processes that exec("/usr/bin/lz4") or whatever. We do similar things > in other places and, while it won't work for cases where you want to > compress a shazillion files, that's not really a problem here anyway. > At least at the moment, the server-side format is *always* tar, so the > problem of needing a separate subprocess for every file in the data > directory does not arise. I really really don't understand this. Are you suggesting that for server side compression etc we're going to add the ability to specify shell commands as argument to the base backup command? That seems so obviously a non-starter? A good default for backup configurations should be that the PG user that the backup is done under is only allowed to do that, and not that it directly has arbitrary remote command execution. > Suppose you want to compress using your favorite compression > program. Well, you can't. Your favorite compression program doesn't > speak the bespoke PostgreSQL protocol required for backup > plugins. Neither does your favorite encryption program. Either would > be perfectly happy to accept a tarfile on stdin and dump out a > compressed or encrypted version, as the case may be, on stdout, but > sorry, no such luck. You need a special program that speaks the magic > PostgreSQL protocol but otherwise does pretty much the exact same > thing as the standard one. But the tool speaking the protocol can just allow piping through whatever tool? Given that there likely is benefits to either doing things on the client side or on the server side, it seems inevitable that there's multiple places that would make sense to have the capability for? > It's possibly not the exact same thing. A special might, for example, > use multiple threads for parallel compression rather than multiple > processes, perhaps gaining a bit of efficiency. But it's doubtful > whether all users care about such marginal improvements. Marginal improvements? Compression scales decently well with the number of cores. pg_basebackup's compression is useless because it's so slow (and because its clientside, but that's IME the lesser issue). I feel I must be misunderstanding what you mean here. gzip - vs pigz -p $numcores on my machine: 180MB/s vs 2.5GB/s. The latter will still sometimes be a bottleneck (it's a bottlenck in pigz, not available compression cycles), but a lot less commonly than 180. Greetings, Andres Freund
Re: backup manifests
On 2020/04/14 0:15, Robert Haas wrote: On Sun, Apr 12, 2020 at 10:09 PM Fujii Masao wrote: I found other minor issues. I think these are all correct fixes. Thanks for the post-commit review, and sorry for this mistakes. Thanks for the review, Michael and Robert. Pushed the patches! Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: snapshot too old issues, first around wraparound and then more.
On Mon, Apr 13, 2020 at 2:58 PM Thomas Munro wrote: > On Fri, Apr 3, 2020 at 2:22 PM Peter Geoghegan wrote: > > I think that it's worth considering whether or not there are a > > significant number of "snapshot too old" users that rarely or never > > rely on old snapshots used by new queries. Kevin said that this > > happens "in some cases", but how many cases? Might it be that many > > "snapshot too old" users could get by with a version of the feature > > that makes the most conservative possible assumptions, totally giving > > up on the idea of differentiating which blocks are truly safe to > > access with an "old" snapshot? (In other words, one that assumes that > > they're *all* unsafe for an "old" snapshot.) > > > > I'm thinking of a version of "snapshot too old" that amounts to a > > statement timeout that gets applied for xmin horizon type purposes in > > the conventional way, while only showing an error to the client if and > > when they access literally any buffer (though not when the relation is > > a system catalog). Is it possible that something along those lines is > > appreciably better than nothing to users? If it is, and if we can find > > a way to manage the transition, then maybe we could tolerate > > supporting this greatly simplified implementation of "snapshot too > > old". > > Interesting idea. I'm keen to try prototyping it to see how well it > works out it practice. Let me know soon if you already have designs > on that and I'll get out of your way, otherwise I'll give it a try and > share what I come up with. Here's a quick and dirty test patch of that idea (or my understanding of it), just for experiments. It introduces snapshot->expire_time and a new timer SNAPSHOT_TIMEOUT to cause the next CHECK_FOR_INTERRUPTS() to set snapshot->too_old on any active or registered snapshots whose time has come, and then try to advance MyPgXact->xmin, without considering the ones marked too old. That gets rid of the concept of "early pruning". You can use just regular pruning, because the snapshot is no longer holding the regular xmin back. Then TestForOldSnapshot() becomes simply if (snapshot->too_old) ereport(...). There are certainly some rough edges, missed details and bugs in here, not least the fact (pointed out to me by Andres in an off-list chat) that we sometimes use short-lived snapshots without registering them; we'd have to fix that. It also does nothing to ensure that TestForOldSnapshot() is actually called at all the right places, which is still required for correct results. If those problems can be fixed, you'd have a situation where snapshot-too-old is a coarse grained, blunt instrument that effectively aborts your transaction even if the whole cluster is read-only. I am not sure if that's really truly useful to anyone (ie if these ODBC cursor users would be satisfied; I'm not sure I understand that use case). Hmm. I suppose it must be possible to put the LSN check back: if (snapshot->too_old && PageGetLSN(page) > snapshot->lsn) ereport(...). Then the granularity would be the same as today -- block level -- but the complexity is transferred from the pruning side (has to deal with xid time map) to the snapshot-owning side (has to deal with timers, CFI() and make sure all snapshots are registered). Maybe not a great deal, and maybe not easier than fixing the existing bugs. One problem is all the new setitimer() syscalls. I feel like that could be improved, as could statement_timeout, by letting existing timers run rather than repeatedly rescheduling eagerly, so that eg a 1 minute timeout never gets rescheduled more than once per minute. I haven't looked into that, but I guess it's no worse than the existing implement's overheads anyway. PS in the patch the GUC is interpreted as milliseconds, which is more fun for testing but it should really be minutes like before. From 8a5455c7e376bd6ceddf956f789cfdede0396f3f Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Wed, 15 Apr 2020 12:35:51 +1200 Subject: [PATCH] Implement snapshot_too_old using a timer. Remove the previous implementation of "snapshot too old", and replace it with a very conservative implementation based only on time passing. Use a timer to mark active and registered snapshots as too old and potentially advance the backend's xmin. Snapshots that reach this state can no longer be used to read buffers, because data they need to see may be vacuumed away. XXX Incomplete experimental code! --- src/backend/access/common/toast_internals.c | 12 +- src/backend/access/heap/pruneheap.c | 4 +- src/backend/catalog/index.c | 15 +- src/backend/commands/vacuum.c | 3 +- src/backend/storage/buffer/bufmgr.c | 17 - src/backend/storage/ipc/ipci.c | 2 - src/backend/storage/ipc/procarray.c | 23 +- src/backend/tcop/postgres.c | 6 + src/backend/utils/init/globals.c| 1 + src/backend/utils/init/postinit.c
Re: Race condition in SyncRepGetSyncStandbysPriority
At Tue, 14 Apr 2020 16:32:40 -0400, Tom Lane wrote in > I wrote: > > It doesn't seem to me to be that hard to implement the desired > > semantics for synchronous_standby_names with inconsistent info. > > In FIRST mode you basically just need to take the N smallest > > priorities you see in the array, but without assuming there are no > > duplicates or holes. It might be a good idea to include ties at the > > end, that is if you see 1,2,2,4 or 1,3,3,4 and you want 2 sync > > standbys, include the first three of them in the calculation until > > the inconsistency is resolved. In ANY mode I don't see that > > inconsistent priorities matter at all. > > Concretely, I think we ought to do the attached, or something pretty > close to it. Looking SyncRepGetSyncStandbys, I agree that it's good not assuming lowest_priority, which I thought as the culprit of the assertion failure. The current code intends to use less memory. I don't think there is a case where only 3 out of 1000 standbys are required to be sync-standby so collecting all wal senders then sorting them seems reasonable strategy. The new code looks clearer. + stby->is_sync_standby = true; /* might change below */ I'm uneasy with that. In quorum mode all running standbys are marked as "sync" and that's bogus. The only users of the flag seems to be: SyncRepGetSyncRecPtr: + *am_sync = sync_standbys[i].is_sync_standby; and SyncRepGetOldestSyncRecPtr: + /* Ignore candidates that aren't considered synchronous */ + if (!sync_standbys[i].is_sync_standby) + continue; On the other hand sync_standbys is already sorted in priority order so I think we can get rid of the member by setting *am_sync as the follows. SyncRepGetSyncRecPtr: if (sync_standbys[i].is_me) { *am_sync = (i < SyncRepConfig->num_sync); break; } And the second user can be as the follows. SyncRepGetOldestSyncRecPtr: /* Ignore candidates that aren't considered synchronous */ if (i >= SyncRepConfig->num_sync) break; > I'm not really happy about breaking ties based on walsnd_index, > but I see that there are several TAP test cases that fail if we > do something else. I'm inclined to think those tests are bogus ... > but I won't argue to change them right now. Agreed about the tie-breaker. I'm looking this more closer. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [PATCH] Keeps tracking the uniqueness with UniqueKey
On Wed, 15 Apr 2020 at 12:19, Andy Fan wrote: > >> 2. I think 0002 is overly restrictive in its demands that >> parse->hasAggs must be false. We should be able to just use a Group >> Aggregate with unsorted input when the input_rel is unique on the >> GROUP BY clause. This will save on hashing and sorting. Basically >> similar to what we do for when a query contains aggregates without any >> GROUP BY. >> > > Yes, This will be a perfect result, the difficult is the current > aggregation function > execution is highly coupled with Agg node(ExecInitAgg) which is removed in the > unique case. This case here would be slightly different. It would be handled by still creating a Group Aggregate path, but just not consider Hash Aggregate and not Sort the input to the Group Aggregate path. Perhaps that's best done by creating a new flag bit and using it in create_grouping_paths() in the location where we set the flags variable. If you determine that the input_rel is unique for the GROUP BY clause, and that there are aggregate functions, then set a flag, e.g GROUPING_INPUT_UNIQUE. Likely there will be a few other flags that you can skip setting in that function, for example, there's no need to check if the input can sort, so no need for GROUPING_CAN_USE_SORT, since you won't need to sort, likewise for GROUPING_CAN_USE_HASH. I'd say there also is no need for checking if we can set GROUPING_CAN_PARTIAL_AGG (What would be the point in doing partial aggregation when there's 1 row per group?) Then down in add_paths_to_grouping_rel(), just add a special case before doing any other code, such as: if ((extra->flags & GROUPING_INPUT_UNIQUE) != 0 && parse->groupClause != NIL) { Path*path = input_rel->cheapest_total_path; add_path(grouped_rel, (Path *) create_agg_path(root, grouped_rel, path, grouped_rel->reltarget, AGG_SORTED, AGGSPLIT_SIMPLE, parse->groupClause, havingQual, agg_costs, dNumGroups)); return; } You may also want to consider the cheapest startup path there too so that the LIMIT processing can do something smarter later in planning (assuming cheapest_total_path != cheapest_startup_path (which you'd need to check for)). Perhaps it would be better to only set the GROUPING_INPUT_UNIQUE if there is a groupClause, then just Assert(parse->groupClause != NIL) inside that if. David
Re: Race condition in SyncRepGetSyncStandbysPriority
On Tue, Apr 14, 2020 at 04:32:40PM -0400, Tom Lane wrote: > I wrote: > > It doesn't seem to me to be that hard to implement the desired > > semantics for synchronous_standby_names with inconsistent info. > > In FIRST mode you basically just need to take the N smallest > > priorities you see in the array, but without assuming there are no > > duplicates or holes. It might be a good idea to include ties at the > > end, that is if you see 1,2,2,4 or 1,3,3,4 and you want 2 sync > > standbys, include the first three of them in the calculation until > > the inconsistency is resolved. In ANY mode I don't see that > > inconsistent priorities matter at all. > > Concretely, I think we ought to do the attached, or something pretty > close to it. > > I'm not really happy about breaking ties based on walsnd_index, > but I see that there are several TAP test cases that fail if we > do something else. I'm inclined to think those tests are bogus ... > but I won't argue to change them right now. This passes the test battery I wrote in preparation for the 2020-02 thread.
Re: Vacuum o/p with (full 1, parallel 0) option throwing an error
On Tue, Apr 14, 2020 at 8:55 AM Amit Kapila wrote: > > On Tue, Apr 14, 2020 at 7:52 AM Michael Paquier wrote: > > > > > -VACUUM (PARALLEL 1) tmp; -- disables parallel vacuum option > > +VACUUM (PARALLEL 1) tmp; -- parallel vacuum disabled for temp tables > > WARNING: disabling parallel option of vacuum on "tmp" --- cannot vacuum > > temporary tables in parallel > > +VACUUM (PARALLEL 0, FULL TRUE) tmp; -- can specify parallel disabled (even > > though that's implied by FULL) > > > > To fully close the gap in the tests, I would also add a test for > > (PARALLEL 1, FULL false) where FULL directly specified, even if that > > sounds like a nit. That's fine to test even on a temporary table. > > > > Okay, I will do this once we agree on the error message stuff. > I have changed one of the existing tests to test the option suggested by you. Additionally, I have changed the docs as per suggestion from Sawada-san. I haven't changed the error message. Let me know if you have any more comments? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com v5-0001-Fix-the-usage-of-parallel-and-full-options-of-vac.patch Description: Binary data
Re: Vacuum o/p with (full 1, parallel 0) option throwing an error
On Wed, Apr 15, 2020 at 08:54:17AM +0530, Amit Kapila wrote: > On Tue, Apr 14, 2020 at 8:55 AM Amit Kapila wrote: > > > > On Tue, Apr 14, 2020 at 7:52 AM Michael Paquier wrote: > > > > > > > > -VACUUM (PARALLEL 1) tmp; -- disables parallel vacuum option > > > +VACUUM (PARALLEL 1) tmp; -- parallel vacuum disabled for temp tables > > > WARNING: disabling parallel option of vacuum on "tmp" --- cannot > > > vacuum temporary tables in parallel > > > +VACUUM (PARALLEL 0, FULL TRUE) tmp; -- can specify parallel disabled > > > (even though that's implied by FULL) > > > > > > To fully close the gap in the tests, I would also add a test for > > > (PARALLEL 1, FULL false) where FULL directly specified, even if that > > > sounds like a nit. That's fine to test even on a temporary table. > > > > > > > Okay, I will do this once we agree on the error message stuff. > > > > I have changed one of the existing tests to test the option suggested > by you. Additionally, I have changed the docs as per suggestion from > Sawada-san. I haven't changed the error message. Let me know if you > have any more comments? You did: |...then the number of workers is determined based on the number of |indexes that support parallel vacuum operation on the [-relation,-]{+relation+} and is further |limited by . I'd suggest to say instead: |...then the number of workers is determined based on the number of |indexes ON THE RELATION that support parallel vacuum operation, and is further |limited by . -- Justin
Re: Vacuum o/p with (full 1, parallel 0) option throwing an error
On Wed, Apr 15, 2020 at 9:03 AM Justin Pryzby wrote: > > On Wed, Apr 15, 2020 at 08:54:17AM +0530, Amit Kapila wrote: > > On Tue, Apr 14, 2020 at 8:55 AM Amit Kapila wrote: > > > > > > On Tue, Apr 14, 2020 at 7:52 AM Michael Paquier > > > wrote: > > > > > > > > > > > -VACUUM (PARALLEL 1) tmp; -- disables parallel vacuum option > > > > +VACUUM (PARALLEL 1) tmp; -- parallel vacuum disabled for temp tables > > > > WARNING: disabling parallel option of vacuum on "tmp" --- cannot > > > > vacuum temporary tables in parallel > > > > +VACUUM (PARALLEL 0, FULL TRUE) tmp; -- can specify parallel disabled > > > > (even though that's implied by FULL) > > > > > > > > To fully close the gap in the tests, I would also add a test for > > > > (PARALLEL 1, FULL false) where FULL directly specified, even if that > > > > sounds like a nit. That's fine to test even on a temporary table. > > > > > > > > > > Okay, I will do this once we agree on the error message stuff. > > > > > > > I have changed one of the existing tests to test the option suggested > > by you. Additionally, I have changed the docs as per suggestion from > > Sawada-san. I haven't changed the error message. Let me know if you > > have any more comments? > > You did: > |...then the number of workers is determined based on the number of > |indexes that support parallel vacuum operation on the > [-relation,-]{+relation+} and is further > |limited by . > > I'd suggest to say instead: > |...then the number of workers is determined based on the number of > |indexes ON THE RELATION that support parallel vacuum operation, and is > further > |limited by . > I have not changed this now but I find your suggestion better than existing wording. I'll change this before committing the patch unless there are more comments. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Properly mark NULL returns in numeric aggregates
On Wed, 15 Apr 2020 at 03:41, Tom Lane wrote: > > David Rowley writes: > > For testing, can't we just have an Assert() in > > advance_transition_function that verifies isnull matches the > > nullability of the return value for INTERNAL returning transfns? i.e, > > the attached > > FTR, I do not like this Assert one bit. nodeAgg.c has NO business > inquiring into the contents of internal-type Datums. It has even > less business enforcing a particular Datum value for a SQL null --- > we have always, throughout the system, considered that if isnull > is true then the contents of the Datum are unspecified. I think > this is much more likely to cause problems than solve any. OK. the latter case could be ignored by adding an OR condition to the Assert to allow isnull == false cases to pass without any consideration to the Datum value, but it sounds like you don't want to insist that isnull == true returns NULL a pointer. FWIW, I agree with Jesse that having numeric_combine() return a NULL pointer without properly setting the isnull flag is pretty bad and it should be fixed regardless. Not fixing it, even in the absence of having a good way to test it just seems like we're leaving something around that we're going to trip up on in the future. Serialization functions crashing after receiving input from a combine function seems pretty busted to me, regardless if there is a pathway for the functions to be called in that order in core or not. I'm not a fan of leaving it in just because testing for it might not be easy. One problem with coming up with a way of testing from an SQL level will be that we'll need to pick some aggregate functions that currently have this issue and ensure they don't regress. There's not much we can do to ensure any new aggregates we might create the future don't go and break this rule. That's why I thought that the Assert might be more useful. I don't think it would be impossible to test this using an extension and using the create_upper_paths_hook. I see that test_rls_hooks which runs during make check-world does hook into the RLS hooks do test some behaviour. I don't think it would be too tricky to have a hook implement a 3-stage aggregate plan with the middle stage doing a deserial/combine/serial before passing to the Finalize Aggregate node. That would allow us to ensure serial functions can accept the results from combine functions, to which nothing in core currently can do. David
Re: documenting the backup manifest file format
On 2020/04/14 2:40, Robert Haas wrote: On Fri, Mar 27, 2020 at 4:32 PM Andres Freund wrote: I don't like having a file format that's intended to be used by external tools too that's undocumented except for code that assembles it in a piecemeal fashion. Do you mean in a follow-on patch this release, or later? I don't have a problem with the former. Here is a patch for that. While reading the document that you pushed, I thought that it's better to define index term for backup manifest, so that we can easily reach this document from the index page. Thought? Patch attached. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION diff --git a/doc/src/sgml/backup-manifest.sgml b/doc/src/sgml/backup-manifest.sgml index 376aff0d6d..b9634f2706 100644 --- a/doc/src/sgml/backup-manifest.sgml +++ b/doc/src/sgml/backup-manifest.sgml @@ -3,6 +3,10 @@ Backup Manifest Format + + Backup Manifest + + The backup manifest generated by is primarily intended to permit the backup to be verified using
Re: Race condition in SyncRepGetSyncStandbysPriority
On Tue, 14 Apr 2020 at 18:35, Kyotaro Horiguchi wrote: > > At Tue, 14 Apr 2020 13:06:14 +0900, Masahiko Sawada > wrote in > > On Tue, 14 Apr 2020 at 10:34, Tom Lane wrote: > > > > > > Kyotaro Horiguchi writes: > > > > At Sat, 11 Apr 2020 18:30:30 -0400, Tom Lane wrote > > > > in > > > >> What I think we should do about this is, essentially, to get rid of > > > >> SyncRepGetSyncStandbys. Instead, let's have each walsender advertise > > > >> whether *it* believes that it is a sync standby, based on its last > > > >> evaluation of the relevant GUCs. This would be a bool that it'd > > > >> compute and set alongside sync_standby_priority. (Hm, maybe we'd not > > > > > > > Mmm.. SyncRepGetStandbyPriority returns the "priority" that a > > > > walsender thinks it is at, among synchronous_standby_names. Then to > > > > decide "I am a sync standby" we need to know how many walsenders with > > > > higher priority are alive now. SyncRepGetSyncStandbyPriority does the > > > > judgment now and suffers from the inconsistency of priority values. > > > > > > Yeah. After looking a bit closer, I think that the current definition > > > of sync_standby_priority (that is, as the result of local evaluation > > > of SyncRepGetStandbyPriority()) is OK. The problem is what we're doing > > > with it. I suggest that what we should do in SyncRepGetSyncRecPtr() > > > is make one sweep across the WalSnd array, collecting PID, > > > sync_standby_priority, *and* the WAL pointers from each valid entry. > > > Then examine that data and decide which WAL value we need, without > > > assuming > > > that the sync_standby_priority values are necessarily totally consistent. > > > But in any case we must examine each entry just once while holding its > > > mutex, not go back to it later expecting it to still be the same. > > SyncRepGetSyncStandbysPriority() is runing holding SyncRepLock so > sync_standby_priority of any walsender can be changed while the > function is scanning welsenders. The issue is we already have > inconsistent walsender information before we enter the function. Thus > how many times we scan on the array doesn't make any difference. > > I think we need to do one of the followings. > > A) prevent SyncRepGetSyncStandbysPriority from being entered while > walsender priority is inconsistent. > > B) make SyncRepGetSyncStandbysPriority be tolerant of priority > inconsistency. > > C) protect walsender priority array from beinig inconsistent. > > The (B) is the band aids. To achieve A we need to central controller > of priority config handling. C is: > > > Can we have a similar approach of sync_standby_defined for > > sync_standby_priority? That is, checkpionter is responsible for > > changing sync_standby_priority of all walsenders when SIGHUP. That > > way, all walsenders can see a consistent view of > > sync_standby_priority. And when a walsender starts, it sets > > sync_standby_priority by itself. The logic to decide who's a sync > > standby doesn't change. SyncRepGetSyncRecPtr() gets all walsenders > > having higher priority along with their WAL positions. > > Yeah, it works if we do , but the problem of that way is that to > determin priority of walsenders, we need to know what walsenders are > running. That is, when new walsender comes the process needs to aware > of the arrival (or leaving) right away and reassign the priority of > every wal senders again. I think we don't need to reassign the priority when new walsender comes or leaves. IIUC The priority is calculated based on only synchronous_standby_names. Coming or leaving a walsender doesn't affect other's priorities. Regards, -- Masahiko Sawadahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Tue, Apr 14, 2020 at 9:14 PM Erik Rijkers wrote: > > On 2020-04-14 12:10, Dilip Kumar wrote: > > > v14-0001-Immediately-WAL-log-assignments.patch + > > v14-0002-Issue-individual-invalidations-with.patch + > > v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch+ > > v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch+ > > v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch + > > v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch+ > > v14-0007-Track-statistics-for-streaming.patch + > > v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch + > > v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch + > > v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch > > applied on top of 8128b0c (a few hours ago) > > Hi, > > I haven't followed this thread and maybe this instabilty is > known/expected; just thought I'd let you know. > > When doing running a pgbench run over logical replication (cascading > down two replicas), I get this segmentation fault. Thanks for the testing. Is it possible to share the call stack? > > 2020-04-14 17:27:28.135 CEST [8118] DETAIL: Streaming transactions > committing after 0/5FA2A38, reading WAL from 0/5FA2A00. > 2020-04-14 17:27:28.135 CEST [8118] LOG: logical decoding found > consistent point at 0/5FA2A00 > 2020-04-14 17:27:28.135 CEST [8118] DETAIL: There are no running > transactions. > 2020-04-14 17:27:28.138 CEST [8006] LOG: server process (PID 8118) was > terminated by signal 11: Segmentation fault > 2020-04-14 17:27:28.138 CEST [8006] DETAIL: Failed process was running: > COMMIT > 2020-04-14 17:27:28.138 CEST [8006] LOG: terminating any other active > server processes > 2020-04-14 17:27:28.138 CEST [8163] WARNING: terminating connection > because of crash of another server process > 2020-04-14 17:27:28.138 CEST [8163] DETAIL: The postmaster has > commanded this server process to roll back the current transaction and > exit, because another server process exited abnormally and possibly > corrupted shared memory. > 2020-04-14 17:27:28.138 CEST [8163] HINT: In a moment you should be > able to reconnect to the database and repeat your command. > > > This error happens somewhat buried away in my test-stuff; I can dig it > out and make it into a repeatable test if you need it. (debian > stretch/gcc 9.3.0) Yeah, that will be great. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com