Re: [HACKERS] The plan for FDW-based sharding
On 02/27/2016 06:57 AM, Robert Haas wrote: On Sat, Feb 27, 2016 at 1:49 AM, Konstantin Knizhnik wrote: pg_tsdtm is based on another approach: it is using system time as CSN and doesn't require arbiter. In theory there is no limit for scalability. But differences in system time and necessity to use more rounds of communication have negative impact on performance. How do you prevent clock skew from causing serialization anomalies? If node receives message from "feature" it just needs to wait until this future arrive. Practically we just "adjust" system time in this case, moving it forward (certainly system time is not actually changed, we just set correction value which need to be added to system time). This approach was discussed in the article: http://research.microsoft.com/en-us/people/samehe/clocksi.srds2013.pdf I hope, in this article algorithm is explained much better than I can do here. Few notes: 1. I can not prove that our pg_tsdtm absolutely correctly implements approach described in this article. 2. I didn't try to formally prove that our implementation can not cause some serialization anomalies. 3. We just run various synchronization tests (including simplest debit-credit test which breaks old version of Postgtes-XL) during several days and we didn't get any inconsistencies. 4. We have tested pg_tsdtm both at single node, blade cluster and geographically distributed nodes (distance more than thousand kilometers: one server was in Vladivostok, another in Kaliningrad). Ping between these two servers takes about 100msec. Performance of our benchmark drops about 100 times but there was no inconsistencies. Also I once again want to notice that primary idea of the proposed patch was not pg_tsdtm. There are well know limitation of this pg_tsdtm which we will try to address in future. What we want is to include XTM API in PostgreSQL to be able to continue our experiments with different transaction managers and implementing multimaster on top of it (our first practical goal) without affecting PostgreSQL core. If XTM patch will be included in 9.6, then we can propose our multimaster as PostgreSQL extension and everybody can use it. Otherwise we have to propose our own fork of Postgres which significantly complicates using and maintaining it. So there is no ideal solution which can work well for all cluster. This is why it is not possible to develop just one GTM, propose it as a patch for review and then (hopefully) commit it in Postgres core. IMHO it will never happen. And I do not think that it is actually needed. What we need is a way to be able to create own transaction managers as Postgres extension not affecting its core. This seems rather defeatist. If the code is good and reliable, why should it not be committed to core? Two reasons: 1. There is no ideal implementation of DTM which will fit all possible needs and be efficient for all clusters. 2. Even if such implementation exists, still the right way of it integration is Postgres should use kind of TM API. I hope that everybody will agree that doing it in this way: #ifdef PGXC /* In Postgres-XC, stop timestamp has to follow the timeline of GTM */ xlrec.xact_time = xactStopTimestamp + GTMdeltaTimestamp; #else xlrec.xact_time = xactStopTimestamp; #endif or in this way: xlrec.xact_time = xactUseGTM ? xactStopTimestamp + GTMdeltaTimestamp : xactStopTimestamp; is very very bad idea. In OO programming we should have abstract TM interface and several implementations of this interface, for example MVCC_TM, 2PL_TM, Distributed_TM... This is actually what can be done with our XTM API. As far as Postgres is implemented in C, not in C++ we have to emulate interfaces using structures with function pointers. And please notice that there is completely no need to include DTM implementation in core, as far as it is not needed for everybody. It can be easily distributed as extension. I have that quite soon we can propose multimaster extension which should provides functionality similar with MySQL Gallera. But even right now we have integrated pg_dtm and pg_tsdtm with pg_shard and postgres_fdw, allowing to provide distributed consistency for them. All arguments against XTM can be applied to any other extension API in Postgres, for example FDW. Is it general enough? There are many useful operations which currently are not handled by this API. For example performing aggregation and grouping at foreign server side. But still it is very useful and flexible mechanism, allowing to implement many wonderful things. That is true. And everybody is entitled to an opinion on each new proposed hook, as to whether that hook is general or not. We have both accepted and rejected proposed hooks in the past. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pg
Re: [HACKERS] Relation cache invalidation on replica
On 02/27/2016 04:16 AM, Simon Riggs wrote: On 27 February 2016 at 00:33, Simon Riggs mailto:si...@2ndquadrant.com>> wrote: On 27 February 2016 at 00:29, Andres Freund mailto:and...@anarazel.de>> wrote: On 2016-02-26 18:05:55 +0300, Konstantin Knizhnik wrote: > The reason of the problem is that invalidation messages are not delivered to > replica after the end of concurrent create index. > Invalidation messages are included in xlog as part of transaction commit > record. > Concurrent index create is split into three transaction, last of which is > just performing inplace update of index tuple, marking it as valid and > invalidating cache. But as far as this transaction is not assigned XID, no > transaction record is created in WAL and send to replicas. As a result, > replica doesn't receive this invalidation messages. Ugh, that's a fairly ugly bug. Looking now. If the above is true, then the proposed fix wouldn't work either. No point in sending a cache invalidation message on the standby if you haven't also written WAL, since the catalog re-read would just see the old row. heap_inplace_update() does write WAL, which blows away the starting premise. So I'm not seeing this as an extant bug in an open source version of PostgreSQL, in my current understanding. Inplace update really creates record in WAL and this is why index is marked as valid at replica. But invalidation messages are sent only with transaction commit record and such record is not created in this case, because there is no assigned XID. This is a real bug which originally observed by one of our customers with different versions of Postgres (last one them have tried was 9.5.1). Then we reproduced it locally and determined the reason of the problem. Repro scenario is very simple: you just need to create large enough table (pgbench with scale factor 100 works well in my case) so that "create index concurrently" takes substantial amount of time. If, while this statement is in progress, you will execute some query at replica which uses this index, then it will cache state of relation without index. And after even when index is actually constructed, it will never be used in this backend (but other backends at replica will use it). I am not sure about the best way of fixing the problem. I have not tested Andreas proposal: if (nrels != 0 || nmsgs != 0 || RelcacheInitFileInval) if it actually fixes the problem. Assigning XID in heap_inplace_update definitely should work. It is better than forcing assignment XID in DefineIndex? I am not sure, because this problem seems to be related only with concurrent update (but may be I am wrong). At least not all inplace updates should cause catalog invalidation and so require XID assignment. -- Simon Riggs http://www.2ndQuadrant.com/ <http://www.2ndquadrant.com/> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] The plan for FDW-based sharding
On 02/27/2016 06:54 AM, Robert Haas wrote: On Fri, Feb 26, 2016 at 10:56 PM, Konstantin Knizhnik wrote: We do not have formal prove that proposed XTM is "general enough" to handle all possible transaction manager implementations. But there are two general ways of dealing with isolation: snapshot based and CSN based. I don't believe that for a minute. For example, consider this article: Well, I have to agree that saying that there are just two ways of providing distributed isolation I was not right. There is at least one more method: conservative locking. But it will cause huge number of extra network messages which has to be exchanged. Also I mostly considered solutions compatible with PostgreSQL MVCC model. And definitely their are other approaches. Like preserving transaction commit order (as it is done in Galera). Some other them can be implemented with XTM (preserving commit order), some - not (2PL). I have already noticed that XTM is not allowing to implement ANY transaction manager. But we have considered several approaches to distributed transaction management explained in the article related with really working systems. Some of them are real production system as SAP HANA, some are just prototypes, but working prototypes for which authors have performed some benchmarking and comparison with other approaches. The references you have mentioned are mostly theoretical description of the problem. Nice to know it but it is hard to build some concrete implementation based on this articles. Briefly answering other your questions: For example, consider a table with a million rows spread across any number of servers. It is sharding scenario, pg_tsdtm will work well in this case does not requiring sending a lot of extra messages. Now consider another workload where each transaction reads a row one one server, reads a row on another server, It can be solved both with pg_dtm (central arbiter) and pg_tsdtm (no arbiter), But actually you scenarios just once again proves that there can not be just one ideal distributed TM. So maybe the goal for the GTM isn't to provide true serializability across the cluster but some lesser degree of transaction isolation. But then exactly which serialization anomalies are we trying to prevent, and why is it OK to prevent those and not others? Absolutely agree. There are some theoretical discussion regarding CAP and different distributed level of isolation. But at practice people want to solve their tasks. Most of PostgeSQL used are using default isolation level: read committed although there are alot of "wonderful" anomalies with it. Serialazable transaction in Oracle are actually violating fundamental serializability rule and still Oracle is one of ther most popular database in the world... The was isolation bug in Postgres-XL which doesn't prevent from using it by commercial customers... So I do not say that discussing all this theoretical questions is not need as formally proven correctness of distributed algorithm. But I do not understand hot why it should prevent from providing extensible TM API. Yes, we can tot do everything with it. But still we can implement many different approaches. I think that it somehow proves that it is "general enough". https://en.wikipedia.org/wiki/Global_serializability I think the neutrality of that article is *very* debatable, but it certainly contradicts the idea that snapshots and CSNs are the only methods of achieving global serializability. Or consider this lecture: http://hssl.cs.jhu.edu/~randal/416/lectures.old/ln5.2.pdf That's a great introduction to the problem we're trying to solve here, but again, snapshots are not mentioned, and CSNs certainly aren't mentioned. This write-up goes further, explaining three different methods for ensuring global serializability, none of which mention snapshots or CSNs: http://heaven.eee.metu.edu.tr/~vision/LectureNotes/EE442/Ee442ch7.html Actually, I think the second approach is basically a snapshot/CSN-type approach, but it doesn't use that terminology and the connection to what you are proposing is very unclear. I think you're approaching this problem from a viewpoint that is entirely too focused on the code that exists in PostgreSQL today. Lots of people have done lots of academic research on how to solve this problem, and you can't possibly say that CSNs and snapshots are the only solution to this problem unless you haven't read any of those papers. The articles above aren't exceptional in mentioning neither of the approaches that you are advocating - they are typical of the literature in this area. How can it be that the only solutions to this problem are ones that are totally different from the approaches that university professors who spend time doing research on concurrency have spent time exploring? I think we need to back up here and
Re: [HACKERS] The plan for FDW-based sharding
Neither pg_dtm, neither pg_tsdtm supports serializable isolation level. We implemented distributed snapshot isolation - repeatable-read isolation level. We also do not support read-committed isolation level now. We do not try to preserve transaction commit order at all nodes. But in principle it can be implemented using XTM API: it allows to redefine function which actually sets transaction status. pg_dtm performs 2PC here. And in principle it is possible to enforce commits in any particular order. Concerning CSNs, may be you are right and it is not correct to use this notion in this case. Actually there are many "CSNs" involved in transaction commit. First of all each transaction is assigned local CSN (timestamp) when it is ready to commit. Then CSNs of all nodes are exchanged and maximal CSN is chosen. This maximum is writen as final transaction CSN and is used in visibility check. On 02/27/2016 01:48 AM, Kevin Grittner wrote: On Fri, Feb 26, 2016 at 2:19 PM, Konstantin Knizhnik wrote: pg_tsdtm is based on another approach: it is using system time as CSN Which brings up an interesting point, if we want logical replication to be free of serialization anomalies for those using serializable transactions, we need to support applying transactions in an order which may not be the same as commit order -- CSN (as such) would be the wrong thing. If serializable transaction 1 (T1) modifies a row and concurrent serializable transaction 2 (T2) reads the old version of the row, and modifies something based on that, T2 must be applied to a logical replica first even if T1 commits before it; otherwise the logical replica could see a state not consistent with business rules and which could not have been seen (due to SSI) on the source database. Any DTM API which does not support some mechanism to rearrange the order of transactions from commit order to some other order (based on, for example, read-write dependencies) is not complete. If it does support that, it gives us a way forward for presenting consistent data on logical replicas. To avoid confusion, it might be best to reserve CSN for actual commit sequence numbers, or at least values which increase monotonically with each commit. The term of art for what I described above is "apparent order of execution", so maybe we want to use AOE or AOoE for the order we choose to use in a particular implementation. It doesn't seem to me to be outright inaccurate for cases where the system time on the various systems is used. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] The plan for FDW-based sharding
On 02/27/2016 11:38 PM, Kevin Grittner wrote: Is this an implementation of some particular formal technique? If so, do you have a reference to a paper on it? I get the sense that there has been a lot written about distributed transactions, and that it would be a mistake to ignore it, but I have not (yet) reviewed the literature for it. The reference to the article is at our WiKi pages explaining our DTM: https://wiki.postgresql.org/wiki/DTM http://research.microsoft.com/en-us/people/samehe/clocksi.srds2013.pdf -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] The plan for FDW-based sharding
Thank you very much for you comments. On 01.03.2016 18:19, Robert Haas wrote: On Sat, Feb 27, 2016 at 2:29 AM, Konstantin Knizhnik wrote: How do you prevent clock skew from causing serialization anomalies? If node receives message from "feature" it just needs to wait until this future arrive. Practically we just "adjust" system time in this case, moving it forward (certainly system time is not actually changed, we just set correction value which need to be added to system time). This approach was discussed in the article: http://research.microsoft.com/en-us/people/samehe/clocksi.srds2013.pdf I hope, in this article algorithm is explained much better than I can do here. Hmm, the approach in that article is very interesting, but it sounds different than what you are describing - they do not, AFAICT, have anything like a "correction value" In the article them used anotion "wait": if T.SnapshotTime>GetClockTime() then wait until T.SnapshotTimeOriginally we really do sleep here, but then we think that instead of sleeping we can just adjust local time. Sorry, I do not have format prove it is equivalent but... at least we have not encountered any inconsistencies after this fix and performance is improved. There are well know limitation of this pg_tsdtm which we will try to address in future. How well known are those limitations? Are they documented somewhere? Or are they only well-known to you? Sorry, well know for us. But them are described at DTM wiki page. Right now pg_tsdtm is not supporting correct distributed deadlock detection (is not building global lock graph) and is detecting distributed deadlocks just based on timeouts. It doesn't support explicit locks but "select for update" will work correctly. What we want is to include XTM API in PostgreSQL to be able to continue our experiments with different transaction managers and implementing multimaster on top of it (our first practical goal) without affecting PostgreSQL core. If XTM patch will be included in 9.6, then we can propose our multimaster as PostgreSQL extension and everybody can use it. Otherwise we have to propose our own fork of Postgres which significantly complicates using and maintaining it. Well I still think what I said before is valid. If the code is good, let it be a core submission. If it's not ready yet, submit it to core when it is. If it can't be made good, forget it. I have nothing against committing DTM code in core. But still the best way of integration it is to use a-la-OO approach. So still need API. Inserting if-s or switches in existed code is IMHO ugly idea. Also it is not enough for DTM code to be just "good". It should provide expected functionality. But which functionality is expected? From my experience of development different cluster solutions I can say that different customers have very different requirements. It is very hard if ever possible to satisfy them all. Right now I do not feel that I can predict all possible requirements to DTM. This is why we want to provide some API, propose some implementations of this API, receive feedbecks and get better understanding which functionality is actually needed by customers. This seems rather defeatist. If the code is good and reliable, why should it not be committed to core? Two reasons: 1. There is no ideal implementation of DTM which will fit all possible needs and be efficient for all clusters. Hmm, what is the reasoning behind that statement? I mean, it is certainly true that there are some places where we have decided that one-size-fits-all is not the right approach. Indexing, for example. But there are many other places where we have not chosen to make things pluggable, and that I don't think it should be taken for granted that plugability is always an advantage. I fear that building a DTM that is fully reliable and also well-performing is going to be really hard, and I think it would be far better to have one such DTM that is 100% reliable than two or more implementations each of which are 99% reliable. The question is not about it's reliability, but mostly about its functionality and flexibility. 2. Even if such implementation exists, still the right way of it integration is Postgres should use kind of TM API. Sure, APIs are generally good, but that doesn't mean *this* API is good. Well, I do not what to say "better than nothing", but I find this API to be a reasonable compromise between flexibility and minimization of changes in PostgreSQL core. If you have some suggestions how to improve it, I will be glad to receive them. I hope that everybody will agree that doing it in this way: #ifdef PGXC /* In Postgres-XC, stop timestamp has to follow the timeline of GTM */ xlrec.xact_time = xactStopTimestamp + GTMdeltaTimestamp; #else xlrec.xact_time = xactStopTimestamp;
Re: [HACKERS] The plan for FDW-based sharding
On 01.03.2016 19:03, Robert Haas wrote: On Tue, Mar 1, 2016 at 10:37 AM, Bruce Momjian wrote: On Tue, Mar 1, 2016 at 10:19:45AM -0500, Robert Haas wrote: Two reasons: 1. There is no ideal implementation of DTM which will fit all possible needs and be efficient for all clusters. Hmm, what is the reasoning behind that statement? I mean, it is certainly true that there are some places where we have decided that one-size-fits-all is not the right approach. Indexing, for example. Uh, is that even true of indexing? While the plug-in nature of indexing allows for easier development and testing, does anyone create plug-in indexing that isn't shipped by us? I thought WAL support was something that prevented external indexing solutions from working. True. There is an API, though, and having pluggable WAL support seems desirable too. At the same time, I don't think we know of anyone maintaining a non-core index AM ... and there are probably good reasons for that. We end up revising the index AM API pretty regularly every time somebody wants to do something new, so it's not really a stable API that extensions can just tap into. I suspect that a transaction manager API would end up similarly situated. IMHO non-stable API is better than lack of API. Just because it makes it possible to implement features in modular way. And refactoring of API is not so difficult thing... -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] The plan for FDW-based sharding
On 03/01/2016 09:19 PM, Petr Jelinek wrote: Since this thread heavily discusses the XTM, I have question about the XTM as proposed because one thing is very unclear to me - what happens when user changes the XTM plugin on the server? I didn't see any xid handover API which makes me wonder if changes of a plugin (or for example failure to load previously used plugin due to admin error) will send server to similar situation as xid wraparound. Transaction manager is very "intimate" part of DBMS and certainly bugs and problems in custom TM implementation can break the server. So if you are providing custom TM implementation, you should take full responsibility on system integrity. XTM API itself doesn't enforce any XID handling policy. As far as we do not want to change tuple header format, XID is still 32-bit integer. In case of pg_dtm, global transactions at all nodes are assigned the same XID by arbiter. Arbiter is handling XID wraparound. In pg_tsdtm each node maintains its own XIDs, actually pg_tsdtm doesn't change way of assigning CIDs by Postgres. So wraparound in this case is handled in standard way. Instead of assigning own global XIDs, pg_tsdtm provides mapping between local XIDs and global CSNs. Visibility checking rules looks on CSNs, not on XIDs. In both cases if system is for some reasons restarted and DTM plugin failed to be loaded, you can still access database locally. No data can be lost. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PROPOSAL: Fast temporary tables
As far as I know we are trying to kill two birds with one stone: 1. Reduce overhead of accessing temporary tables 2. Make it possible to create temporary tables on replica. Replicas with hot-standby are widely used for running read-only OLAP queries. But such queries usually stores intermediate results in temporary tables. Unfortunately creating temporary table at read-only replica is impossible now. So some customers do the following tricks: them create pool of file FDWs at master and then use them at replicas. But IMHO it is ugly and inefficient hack. Ideally we should be able to create temporary tables at replica, not affecting system catalog. But there are a lot of problems: where it should be stores, how to assign XIDs to the ruples inserted in temporary table,... Unfortunately, looks like there is no simple solution of the problem. The 100% solution is multimaster (which we are currently developing), but it is completely different story... On 03/01/2016 10:17 PM, Jim Nasby wrote: On 3/1/16 10:05 AM, Atri Sharma wrote: Fair point, that means inventing a whole new OID generation structure.. Generation is just the tip of the iceberg. You still need the equivalent to foreign keys (ie: pg_depend). While you would never have a permanent object depend on a temp object, the reverse certainly needs to be supported. If I were attempting to solve this at a SQL level, I'd be thinking about using table inheritance such that the permanent objects are stored in a permanent parent. New backends would create UNLOGGED children off of that parent. There would be a pid column that was always NULL in the parent, but populated in children. That means children could use their own local form of an OID. When a backend terminates you'd just truncate all it's tables. Actually translating that into relcache and everything else would be a serious amount of work. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] The plan for FDW-based sharding
On 01.03.2016 22:02, Bruce Momjian wrote: On Tue, Mar 1, 2016 at 07:56:58PM +0100, Petr Jelinek wrote: Note that I am not saying that other discussed approaches are any better, I am saying that we should know approximately what we actually want and not just beat FDWs with a hammer and hope sharding will eventually emerge and call that the plan. I will say it again --- FDWs are the only sharding method I can think of that has a chance of being accepted into Postgres core. It is a plan, and if it fails, it fails. If is succeeds, that's good. What more do you want me to say? I know of no other way to answer the questions you asked above. I do not understand why it can fail. FDW approach may be not flexible enough for building optimal distributed query execution plans for complex OLAP queries. But for simple queries it should work fine. Simple queries corresponds OLTP and simple OLAP. For OLTP we definitely need transaction manager to provide global consistency. And we have actually prototype of integration postgres_fdw with out pg_dtm and pg_tsdtm transaction managers. The results are quite IMHO promising (see attached diagram). -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company DTM-pgconf.pdf Description: Adobe PDF document -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] The plan for FDW-based sharding
scenarios. It is especially true for DTM, because requirements of various cluster solution are very different. And the most convenient way of doing it is to ship DTM as extension, not as some fork of Postgres. It will greatly simplify using it. Now, on connection pooling, I am similarly not opposed to having some well-designed hooks, but I also think in the long run it would be better for some improvements in this area to be part of core. None of that means I would support any particular hook proposal, of course. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Optimizer questions
On 03/08/2016 07:01 AM, Tom Lane wrote: Konstantin Knizhnik writes: Attached please find improved version of the optimizer patch for LIMIT clause. This patch isn't anywhere close to working after 3fc6e2d7f5b652b4. (TBH, the reason I was negative about this upthread is that I had that one in the oven and knew it would conflict spectacularly.) I encourage you to think about how an optimization of this sort could be made to work in a non-kluge fashion in the new code structure. I've not spent a lot of time on this, but I think maybe what would make sense is to consider both the case where function calculations are postponed to after ORDER BY and the case where they aren't, and generate Paths for both. Neither approach is a slam-dunk win. For example, suppose that one of the tlist columns is md5(wide_column) --- it will likely not be preferable to pass the wide column data through the sort step rather than reducing it to a hash first. This would require some work in grouping_planner to track two possible pathtargets, and work in create_ordered_paths to generate paths for both possibilities. A possible objection is that this would add planning work even when no real benefit is possible; so maybe we should only consider the new way if the tlist has significant eval cost? Not sure about that. There is also something to be said for the idea that we should try to guarantee consistent semantics when the tlist contains volatile functions. For now, I've set this commitfest entry to Waiting on Author. There's still time to consider a rewrite in this 'fest, if you can get it done in a week or two. regards, tom lane Attached please find rebased patch. Unfortunately 3fc6e2d7f5b652b4 still doesn't fix the problem with "lazy" evaluation of target list. This is why my patch is still useful. But frankly speaking I am not sure that it is best way of fixing this problem, because it takes in account only one case: sort+limit. May be the same optimization can be useful for other queries. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c index 5fc8e5b..709d1ad 100644 --- a/src/backend/optimizer/plan/planner.c +++ b/src/backend/optimizer/plan/planner.c @@ -126,8 +126,9 @@ static RelOptInfo *create_ordered_paths(PlannerInfo *root, RelOptInfo *input_rel, double limit_tuples); static PathTarget *make_scanjoin_target(PlannerInfo *root, List *tlist, - AttrNumber **groupColIdx); + AttrNumber **groupColIdx, bool* splitted_projection); static int get_grouping_column_index(Query *parse, TargetEntry *tle); +static int get_sort_column_index(Query *parse, TargetEntry *tle); static List *postprocess_setop_tlist(List *new_tlist, List *orig_tlist); static List *select_active_windows(PlannerInfo *root, WindowFuncLists *wflists); static List *make_windowInputTargetList(PlannerInfo *root, @@ -1381,6 +1382,7 @@ grouping_planner(PlannerInfo *root, bool inheritance_update, RelOptInfo *current_rel; RelOptInfo *final_rel; ListCell *lc; + bool splitted_projection = false; /* Tweak caller-supplied tuple_fraction if have LIMIT/OFFSET */ if (parse->limitCount || parse->limitOffset) @@ -1657,7 +1659,7 @@ grouping_planner(PlannerInfo *root, bool inheritance_update, * that were obtained within query_planner(). */ sub_target = make_scanjoin_target(root, tlist, - &groupColIdx); + &groupColIdx, &splitted_projection); /* * Forcibly apply that tlist to all the Paths for the scan/join rel. @@ -1801,6 +1803,13 @@ grouping_planner(PlannerInfo *root, bool inheritance_update, { Path *path = (Path *) lfirst(lc); + if (splitted_projection) + { + path = apply_projection_to_path(root, current_rel, + path, create_pathtarget(root, tlist)); + } + + /* * If there is a FOR [KEY] UPDATE/SHARE clause, add the LockRows node. * (Note: we intentionally test parse->rowMarks not root->rowMarks @@ -3775,15 +3784,17 @@ create_ordered_paths(PlannerInfo *root, static PathTarget * make_scanjoin_target(PlannerInfo *root, List *tlist, - AttrNumber **groupColIdx) + AttrNumber **groupColIdx, + bool* splitted_projection) { Query *parse = root->parse; - List *sub_tlist; - List *non_group_cols; + List *sub_tlist = NIL; + List *non_group_cols = NIL; List *non_group_vars; int numCols; *groupColIdx = NULL; + *splitted_projection = false; /* * If we're not grouping or aggregating or windowing, there's nothing to @@ -3791,14 +3802,66 @@ make_scanjoin_target(PlannerInfo *root, */ if (!parse->hasAggs && !parse->groupClause && !parse->groupingSets && !root->hasHavingQual && !parse-
Re: [HACKERS] Optimizer questions
On 09.03.2016 09:15, Tom Lane wrote: I wrote: BTW, there's some additional refactoring I had had in mind to do in grouping_planner to make its handling of the targetlist a bit more organized; in particular, I'd like to see it using PathTarget representation more consistently throughout the post-scan-join steps. See 51c0f63e4d76a86b44e87876a6addcfffb01ec28 --- I think this gets things to where we could plug in additional levels of targets without too much complication. regards, tom lane So, if I correctly understand you, there are two major concerns: 1. Volatile functions. I wonder if it is really required to reevaluate volatile function for each record even if LIMIT clause is present? Documentation says: "A query using a volatile function will re-evaluate the function at every row where its value is needed." So if we are using sort with limit and value of function is not used in sort, then we it is correct to say that value of this function is no needed, so there is no need to re-evaluate it, isn't it? 2. Narrowing functions, like md5. Here I do not have any good idea how to support it. Looks like cost of SORT should depend on tuple width. Only in this case optimizer can determine whether it is more efficient to evaluate function earlier or postpone its execution. I think that the best approach is to generate two different paths: original one, when projection is always done before sort and another one with postponed projection of non-trivial columns. Then we compare costs of two paths and choose the best one. Unfortunately, I do not understand now how to implement it with existed grouping_planner. Do you think that it is possible? Alternative approach is to do something like in my proposed patch, but take in account cost of function execution and check presence of volatile/narrowing functions. This approach provides better flexibility, because we can choose subset of columns not-used in sort, which evaluation should be postponed. But here we once again make local decision while construction of the path instead of comparing costs of full paths. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Optimizer questions
On Mar 10, 2016, at 1:56 AM, Tom Lane wrote: > Konstantin Knizhnik writes: >> I think that the best approach is to generate two different paths: >> original one, when projection is always done before sort and another one >> with postponed projection of non-trivial columns. Then we compare costs >> of two paths and choose the best one. >> Unfortunately, I do not understand now how to implement it with existed >> grouping_planner. >> Do you think that it is possible? > > After fooling with this awhile, I don't think it's actually necessary > to do that. See attached proof-of-concept patch. O, you did all my work:) But right now the rule for cost estimation makes it not possible to apply this optimization for simple expressions like this: postgres=# create table a (b integer); CREATE TABLE postgres=# insert into a values (generate_series(1, 10)); INSERT 0 10 postgres=# select b+b from a order by b limit 1; NOTICE: int4pl NOTICE: int4pl NOTICE: int4pl NOTICE: int4pl NOTICE: int4pl NOTICE: int4pl NOTICE: int4pl NOTICE: int4pl NOTICE: int4pl NOTICE: int4pl ?column? -- 2 (1 row) postgres=# create or replace function twice(x integer) returns integer as $$ begin raise notice 'exec function' ; return x + x ; end $$ language plpgsql; CREATE FUNCTION postgres=# select twice(b) from a order by b limit 1; NOTICE: exec function NOTICE: int4pl twice --- 2 (1 row) I wonder is there any advantages of earlier evaluation of such simple expressions if them are not needed for sort? It seems to be only true for narrowing functions, like md5... But I think that it is quite exotic case, isn't it? May be it is reasonable to be more optimistic in estimation cost of postponed expression evaluation? Also I do not completely understand your concern about windows functions. Is there any example of query with windows or aggregate functions when this optimization (postponing expression evaluation) can be applied? It will be also interesting to me to know if there are some other cases (except SORT+LIMIT) when delaying projection leeds to more efficient plan. > > Although this patch gets through our regression tests, that's only because > it's conservative about deciding to postpone evaluation; if it tried to > postpone evaluation in a query with window functions, it would fail > miserably, because pull_var_clause doesn't know about window functions. > I think that that's a clear oversight and we should extend it to offer > the same sorts of behaviors as it does for Aggrefs. But that would be > slightly invasive, there being a dozen or so callers; so I didn't bother > to do it yet pending comments on this patch. > > I think it's probably also broken for SRFs in the tlist; we need to work > out what semantics we want for those. If we postpone any SRF to after > the Sort, we can no longer assume that a query LIMIT enables use of > bounded sort (because the SRF might repeatedly return zero rows). > I don't have a huge problem with that, but I think now would be a good > time to nail down some semantics. > > Some other work that would be useful would be to refactor so that the > cost_qual_eval operations aren't so redundant. But that's just a small > time savings not a question of functionality. > > And we'd have to document that this changes the behavior for volatile > functions. For the better, I think: this will mean that you get > consistent results no matter whether the query is implemented by > indexscan or seqscan-and-sort, which has never been true before. > But it is a change. > > Do people approve of this sort of change in general, or this patch > approach in particular? Want to bikeshed the specific > when-to-postpone-eval policies implemented here? Other comments? > > regards, tom lane > > diff --git a/src/backend/optimizer/plan/planner.c > b/src/backend/optimizer/plan/planner.c > index 8937e71..b15fca1 100644 > *** a/src/backend/optimizer/plan/planner.c > --- b/src/backend/optimizer/plan/planner.c > *** static RelOptInfo *create_distinct_paths > *** 126,131 > --- 126,132 > RelOptInfo *input_rel); > static RelOptInfo *create_ordered_paths(PlannerInfo *root, >RelOptInfo *input_rel, > + PathTarget *target, >double limit_tuples); > static PathTarget *make_group_input_target(PlannerInfo *root, List *tlist); > static List *postprocess_setop_tlist(List *new_tlist, List *orig_tlist); > *** static PathTarget *make_window_input_tar > *** 13
Re: [HACKERS] eXtensible Transaction Manager API (v2)
ral arbiter, so we have to introduce "in-doubt" state of transaction, when it is not known whether transaction is committed or aborted and any other transaction accessing tuples updated but this transaction has to wait while its status is "in-doubt". The main challenge of pg_tsdtm is to make this period as short as possible... But it is details of particular implementation which IMHO have no relation to API itself. 3. Uh, how can you hook GetNewTransactionId but not ReadNewTransactionId? Uh-uh-uh:) ReadNewTransactionId is just reading value of ShmemVariableCache->nextXid, but unfortunately it is not the only point where nextXid is used - there are about hundred occurrences of nextXid in Postgres core. This is why we made a decision that GetNewTransactionId should actually update ShmemVariableCache->nextXid, so that there is no need to rewrite all this code. Sorry, but IMHO it is problem of Postgres design and not of XTM;) We just want to find some compromise which allows XTM to be flexible enough but minimize changes in core code. 4. There seems to be an intention to encapsulate snapshots, but surely wrapping hooks around GetSnapshotData and XidInMVCCSnapshot is not nearly enough for that. Look at all the knowledge snapmgr.c has about snapshot representation, for example. And is a function like GetOldestXmin even meaningful with a different notion of what snapshots are? (For that matter, is TransactionId == uint32 still tenable for any other notion of snapshots?) XTM encapsulation of snapshots allows us to implement pg_dtm. It does almost the same as Postgres-XL GTM, but without huge amount of #ifdefs. Representation of XID is yet another compromise point: we do not want to change tuple header format. So XID is still 32 bit and has the same meanining as in PostgreSQL. If custom implementation of TM wants to use some other identifiers of transactions, like CSN in pg_tsdtm, it has to provide mapping between them and XIDs. 5. BTW, why would you hook at XidInMVCCSnapshot rather than making use of the existing capability to have a custom SnapshotSatisfiesFunc snapshot checker function? HeapTupleSatisfies routines in times/tqual.c have implemented a lot of logic of handling different kind of snapshots, checking/setting hint bits in tuples, caching,... We do not want to replace or just cut© all this code in DTM implementation. And XidInMVCCSnapshot is common function finally used by most HeapTupleSatisfies* functions when all other checks are passed. So it is really the most convenient place to plug-in custom visibility checking rules. And as far as I remember similar approach was used in Postgres-XL. IMO this is not committable as-is, and I don't think that it's something that will become committable during this 'fest. I think we'd be well advised to boot it to the 2016-09 CF and focus our efforts on other stuff that has a better chance of getting finished this month. regards, tom lane -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Batch update of indexes
Hi David, Rebased patch is attached. On 14.03.2016 15:09, David Steele wrote: Hi Konstantin, On 2/3/16 11:47 AM, Konstantin Knizhnik wrote: Attached please find patch for "ALTER INDEX ... WHERE ..." clause. It is now able to handle all three possible situations: 1. Making index partial (add WHERE condition to the ordinary index) 2. Extend partial index range (less restricted index predicate) 3. Arbitrary change of partial index predicate In case 2) new records are added to the index. In other two cases index is completely reconstructed. This patch includes src/bin/insbench utility for testing insert performance. It can be easily excluded from the patch to reduce it size. Also it is better to apply this patch together with "index-only scans with partial indexes" patch: This patch no longer applies on master: $ git apply ../other/alter-index.patch ../other/alter-index.patch:27: trailing whitespace. static void ../other/alter-index.patch:62: indent with spaces. SPIPlanPtr plan; ../other/alter-index.patch:63: indent with spaces. Portal portal; ../other/alter-index.patch:90: trailing whitespace. ../other/alter-index.patch:99: trailing whitespace. error: patch failed: src/test/regress/expected/aggregates.out:831 error: src/test/regress/expected/aggregates.out: patch does not apply Please provide a new patch for review. Meanwhile I am marking this "waiting on author". Thanks, -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/doc/src/sgml/ref/alter_index.sgml b/doc/src/sgml/ref/alter_index.sgml index ee3e3de..ab042ed 100644 --- a/doc/src/sgml/ref/alter_index.sgml +++ b/doc/src/sgml/ref/alter_index.sgml @@ -27,6 +27,7 @@ ALTER INDEX [ IF EXISTS ] name SET ALTER INDEX [ IF EXISTS ] name RESET ( storage_parameter [, ... ] ) ALTER INDEX ALL IN TABLESPACE name [ OWNED BY role_name [, ... ] ] SET TABLESPACE new_tablespace [ NOWAIT ] +ALTER INDEX name WHERE predicate @@ -109,6 +110,18 @@ ALTER INDEX ALL IN TABLESPACE name + +WHERE predicate + + + Add or change predicate of partial index. Extending partial index predicate allows to implement batch update of index and so + increase insert speed. New records (not matching index predicate) can be added to the table at maximal speed without affecting indexes. + Later, in background, indexes can be refreshed using ALTER INDEX ... WHERE ... clause. + See for more discussion. + + + + diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c index 13b04e6..a63de2a 100644 --- a/src/backend/commands/indexcmds.c +++ b/src/backend/commands/indexcmds.c @@ -34,6 +34,7 @@ #include "commands/tablespace.h" #include "mb/pg_wchar.h" #include "miscadmin.h" +#include "funcapi.h" #include "nodes/nodeFuncs.h" #include "optimizer/clauses.h" #include "optimizer/planner.h" @@ -52,6 +53,9 @@ #include "utils/snapmgr.h" #include "utils/syscache.h" #include "utils/tqual.h" +#include "utils/ruleutils.h" +#include "executor/executor.h" +#include "executor/spi.h" /* non-export function prototypes */ @@ -280,6 +284,160 @@ CheckIndexCompatible(Oid oldId, return ret; } +static void +UpdateIndex(Oid indexRelationId, Node* whereClause) +{ + Datum values[Natts_pg_index]; + bool isnull[Natts_pg_index]; + HeapTuple oldTuple; + HeapTuple newTuple; + Relation pg_index; + + pg_index = heap_open(IndexRelationId, RowExclusiveLock); + oldTuple = SearchSysCacheCopy1(INDEXRELID, ObjectIdGetDatum(indexRelationId)); + if (!HeapTupleIsValid(oldTuple)) + elog(ERROR, "cache lookup failed for index %u", indexRelationId); + + heap_deform_tuple(oldTuple, RelationGetDescr(pg_index), values, isnull); + values[Anum_pg_index_indpred - 1] = CStringGetTextDatum(nodeToString(whereClause)); + isnull[Anum_pg_index_indpred - 1] = false; + newTuple = heap_form_tuple(RelationGetDescr(pg_index), values, isnull); + simple_heap_update(pg_index, &oldTuple->t_self, newTuple); + CatalogUpdateIndexes(pg_index, newTuple); + heap_freetuple(newTuple); + heap_freetuple(oldTuple); + heap_close(pg_index, NoLock); +} + +void +AlterIndex(Oid indexRelationId, IndexStmt *stmt) +{ + char* select; + Oid heapRelationId; + IndexUniqueCheck checkUnique; + Datum values[INDEX_MAX_KEYS]; + bool isnull[INDEX_MAX_KEYS]; + Relation heapRelation; + Relation indexRelation; +SPIPlanPtr plan; +Portal portal; + HeapTuple tuple; + TupleTableSlot *slot; + ItemPointer tupleid; + IndexInfo *indexInfo; + EState *estate; + Oid namespaceId; + List* deparseCtx; + char* oldIndexPredicate; + char* newIndexPredicate; + char* relationName; + + Assert(stmt->whereClause); + CheckPredicate((Expr *) stmt->whereClause); + + /* Open and lock the
[HACKERS] Applying logical replication changes by more than one process
Hi, I am trying to use logical replication mechanism in implementation of PostgreSQL multimaster and faced with one conceptual problem. Originally logical replication was intended to support asynchronous replication. In this case applying changes by single process should not be a bottleneck. But if we are using distributed transaction manager to provide global consistency, then applying transaction by one worker leads to very bad performance and what is worser: cause unintended serialization of transactions, which is not taken in account by distributed deadlock detection algorithm and so can cause undetected deadlocks. So I have implemented pool of background workers which can apply transactions concurrently. It works and shows acceptable performance. But now I am thinking about HA and tracking origin LSNs which are needed to correctly specify slot position in case of recovery. And there is a problem: as far as I understand to correctly record origin LSN in WAL and advance slot position it is necessary to setup session using replorigin_session_setup. It is not so convenient in case of using pool of background workers, because we have to setup session for each commit. But the main problem is that for each slot session can be associated only with one process: else if (curstate->acquired_by != 0) { ereport(ERROR, (errcode(ERRCODE_OBJECT_IN_USE), errmsg("replication identifier %d is already active for PID %d", curstate->roident, curstate->acquired_by))); } Which once again means that there can be only one process applying changes. To provide correct state of replication node it is necessary to enforce that each logical replication record is replayed exactly once: we should not loose some change or try to apply it twice. So operation of recording original LSN position in WAL and adjusting slot should be atomic. And during recovery we should restore slot current position based on the origin values extracted from WAL. I wonder if it can be done using current logical replication mechanism when changes of each slot are applied by more than one process? Or the only alternative is to write/read origin LSNs in WAL myself, for example using custom WAL records? Thanks in advance! -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Applying logical replication changes by more than one process
On 21.03.2016 15:10, Petr Jelinek wrote: Hi, On 19/03/16 11:46, Konstantin Knizhnik wrote: Hi, I am trying to use logical replication mechanism in implementation of PostgreSQL multimaster and faced with one conceptual problem. Originally logical replication was intended to support asynchronous replication. In this case applying changes by single process should not be a bottleneck. But if we are using distributed transaction manager to provide global consistency, then applying transaction by one worker leads to very bad performance and what is worser: cause unintended serialization of transactions, which is not taken in account by distributed deadlock detection algorithm and so can cause undetected deadlocks. So I have implemented pool of background workers which can apply transactions concurrently. It works and shows acceptable performance. But now I am thinking about HA and tracking origin LSNs which are needed to correctly specify slot position in case of recovery. And there is a problem: as far as I understand to correctly record origin LSN in WAL and advance slot position it is necessary to setup session using replorigin_session_setup. It is not so convenient in case of using pool of background workers, because we have to setup session for each commit. But the main problem is that for each slot session can be associated only with one process: else if (curstate->acquired_by != 0) { ereport(ERROR, (errcode(ERRCODE_OBJECT_IN_USE), errmsg("replication identifier %d is already active for PID %d", curstate->roident, curstate->acquired_by))); } Which once again means that there can be only one process applying changes. That's not true, all it means is that you can do replorigin_session_setup for same origin only in one process but you don't need to have it setup for session to update it, the replorigin_advance() works just fine. But RecordTransactionCommit is using replorigin_session_advance, not replorigin_advance. And replorigin_session_advance requires that session was setup: void replorigin_session_advance(XLogRecPtr remote_commit, XLogRecPtr local_commit) { Assert(session_replication_state != NULL); } "session_replication_state" is private variable which is set by replorigin_session_setup. But attempt to call replorigin_session_setup from multiple process cause above error. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Applying logical replication changes by more than one process
On Mar 21, 2016, at 4:30 PM, Petr Jelinek wrote: > On 21/03/16 14:25, Andres Freund wrote: >> On 2016-03-21 14:18:27 +0100, Petr Jelinek wrote: >>> On 21/03/16 14:15, Andres Freund wrote: > Only when the origin is actually setup for the current session. You > need > to call the replorigin_advance yourself from your apply code. That's problematic from a durability POV. >>> >>> Huh? How come? >> >> If you use the session mechanism the replication progress is synced with >> the apply process, even if there are crashes. Crash recovery updates the >> progress. There's no such interlock with apply otherwise, and I don't >> see how you can build one with reasonable effort. >> > > Ah you mean because with wal_log=true the origin advance is in different WAL > record than commit? OK yeah you might be one transaction behind then, true. It actually means that we can not enforce database consistency. If we do replorigin_advance before commit and then crash happen, then we will loose some changes. If we call replorigin_advance after commit but crash happen before, then some changes can be applied multiple times. For example we can insert some record twice (if there are no unique constraints). Look likes the only working scenario is to setup replication session for each commit and use locking to prevent concurrent session setup for the same slot by multiple process, doesn't it? I have tried it, fortunately it doesn't cause any noticeable performance degradation. But unfortunately can't consider such approach as elegant. Why it is actually necessary to bind replication slot to process? Why it is not possible to have multiple concurrent sessions for the same slot? Also I concern about using sequential search for slot location in replorigin_session_setup and many other functions - there is loop through all max_replication_slots. It seems to be not a problem when number of slots is less than 10. For multimaster this assumption is true - even Oracle RAC rarely has two-digit number of nodes. But if we want to perform sharding and use logical replication for providing redundancy, then number of nodes and slots can be essentially larger. I didn't think much about such configuration - may be it possible to propose more efficient mechanism for replication in this case. > > -- > Petr Jelinek http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Applying logical replication changes by more than one process
On Mar 22, 2016, at 10:10 AM, Craig Ringer wrote: > On 22 March 2016 at 14:32, konstantin knizhnik > wrote: > >> Ah you mean because with wal_log=true the origin advance is in different WAL >> record than commit? OK yeah you might be one transaction behind then, true. > > It actually means that we can not enforce database consistency. If we do > replorigin_advance before commit and then crash happen, then we will loose > some changes. > If we call replorigin_advance after commit but crash happen before, then some > changes can be applied multiple times. For example we can insert some record > twice (if there are no unique constraints). > Look likes the only working scenario is to setup replication session for each > commit and use locking to prevent concurrent session setup for the same slot > by multiple process, doesn't it? > > Yes. > > How would you expect it to work if you attempted to replorigin_advance > without a session? From multiple concurrent backends? I would not work. But I wonder why I would need to call replorigin_advance without a session. Please excuse me, I am not thinking now about the general case of using logical replication, I just focused on multimaster. What I need is some efficient, durable and atomic mechanism for applying changes. I see only two ways to provide atomicity: 1. Tracking of origins should be done inside xact as part of normal commit. 2. Use custom WAL records. 1) is easier now and it really works if I correctly synchronize access to slots. And surprisingly it even doesn't add substantial overhead. > > Parallel apply is complicated business. You have to make sure you apply xacts > in an order that's free from deadlocks and from insert/delete anomalies - > though you can at least detect those, ERROR that xact and all subsequent > ones, and retry. Well, this is exactly what our multimaster does. We do not try to enforce order of applying xacts. But we detect global deadlocks and use 2PC to provide data consistency. So it is not task of logical replication, it is done by DTM overriding visibility checks and transaction commit protocol using XTM. > For progress tracking to be consistent and correct you'd have to make sure > you committed strictly in the same order as upstream. Just before each commit > you can set the origin LSN and advance the replication origin, which will > commit atomically along with the commit it confirms. I don't really see the > problem. Sorry, I do not completely understand you. What you mean by "will commit atomically along with the commit it confirms"? How this atomicity will be enforced? > > I have tried it, fortunately it doesn't cause any noticeable performance > degradation. But unfortunately can't consider such approach as elegant. > Why it is actually necessary to bind replication slot to process? Why it is > not possible to have multiple concurrent sessions for the same slot? > > Especially since most slot changes LWLock- and/or spinlock-protected already. > > The client would have to manage replay confirmations appropriately so that it > doesn't confirm past the point where some other connection still needs it. > > We'd have to expose a "slot" column in pg_stat_replication and remove the > "pid" column from pg_replication_slots to handle the 1:n relationship between > slot clients and slots, and it'd be a pain to show which normal user backends > were using a slot. Not really sure how to handle that. > > To actually make this useful would require a lot more though. A way to > request that replay start from a new LSN without a full disconnect/reconnect > each time. Client-side parallel consume/apply. Inter-transaction ordering > information so the client can work out a viable xact apply order (possibly > using SSI information per the discussion with Kevin?). Etc. > > I haven't really looked into this and I suspect there are some hairy areas > involved in replaying a slot from more than one client. The reason I'm > interested in it personally is for initial replica state setup as Oleksandr > prototyped and described earlier. We could attach to the slot's initial > snapshot then issue a new replication command that, given a table name or > oid, scans the table from the snapshot and passes each tuple to a new > callback (like, but not the same as, the insert callback) on the output > plugin. > > That way clients could parallel-copy the initial state of the DB across the > same replication protocol they then consume new changes from, with no need to > make normal libpq connections and COPY initial state. > > I'm interested in being able to do parallel receive of ne
Re: [HACKERS] Applying logical replication changes by more than one process
On Mar 22, 2016, at 11:14 AM, Petr Jelinek wrote: > > And each slot means connection with logical decoding attached to it so you > don't really want to have thousands of those anyway. I think you'll hit other > problems faster than loop over slots becomes problem if you plan to keep all > of them active. Assume that cluster have thousands of nodes and we use sharding to scatter data through cluster nodes. But to provide HA we want to perform sharding with some level of redundancy, for example save the same record at 3 different nodes. Once possible approach (pg_shard) is to execute the same query at three different shards. But there is no warranty that result of execution will be the same at all nodes. Alternative approach is to execute transaction at one node and then replicate it using logical replication to replicas. So we do not perform logical replication to all 1000 nodes. Just to 2 of them. But each time it will be different pair of nodes. So we still need to have 1000 active replication slots. May be logical replication can not be used at all in such scenario - I have not thought much about it yet. Our first step will be multimaster without sharding. > > -- > Petr Jelinek http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] avg,first,last,median in one query
Hi, hackers. I need advice from SQL experts: is there any way in PostgreSQL to calculate avg,first,last,median aggregates in one query? Assume that we have the following table: create table Securities ("Symbol" varchar, "Date" date, "Time" time, "Price" real); We can simulate median using percentile_disc: select "Symbol","Date", avg("Price"), percentile_disc(0.5) within group (order by "Price") from Securities group by "Symbol","Date"; And all other aggregates can be calculated using windows functions: select distinct "Symbol","Date", first_value("Price") over (partition by "Symbol","Date" order by "Time" rows between unbounded preceding and unbounded following), last_value("Price") over (partition by "Symbol","Date" order by "Time" rows between unbounded preceding and unbounded following), avg("Price") over (partition by "Symbol","Date" rows between unbounded preceding and unbounded following) from Securities; I wonder is there are any simpler/efficient alternative to the query above? But unfortunately it is not possible to calculate median is such way because percentile_disc is not compatible with OVER: ERROR: OVER is not supported for ordered-set aggregate percentile_disc So is there any chance to calculate all this four aggregates in one query without writing some supplementary functions? Additional question: what is the most efficient way of calculating MEDIAN in PostgreSQL? I found three different approaches: 1. Using CTE: https://www.periscopedata.com/blog/medians-in-sql.html 2. Using user-defined aggregate function which uses array_appendand so materialize all values in memory: https://wiki.postgresql.org/wiki/Aggregate_Median 3. Using percentile aggregate: http://blog.jooq.org/2015/01/06/how-to-emulate-the-median-aggregate-function-using-inverse-distribution-functions/ Thanks in advance, -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Very small patch for decode.c
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c index 2380ea2..a992662 100644 --- a/src/backend/replication/logical/decode.c +++ b/src/backend/replication/logical/decode.c @@ -488,7 +488,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf, { XLogRecPtr origin_lsn = InvalidXLogRecPtr; TimestampTz commit_time = parsed->xact_time; - XLogRecPtr origin_id = XLogRecGetOrigin(buf->record); + RepOriginId origin_id = XLogRecGetOrigin(buf->record); int i; -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Limit and inherited tables
Hi, I am sorry if this question was already discussed but I failed to find any information about in archive. I noticed that LIMIT clause is not pushed down to inherited tables. Consider the following tables: create table foo(x integer primary key); create table foo1 () inherits(foo); create table foo2 () inherits(foo); insert into foo1 values (generate_series(0,10)); insert into foo2 values (generate_series(0,10)); explain select * from foo order by x limit 1; QUERY PLAN Limit (cost=5.10..5.10 rows=1 width=4) -> Sort (cost=5.10..5.61 rows=20 width=4) Sort Key: foo.x -> Append (cost=0.00..4.06 rows=20 width=4) -> Seq Scan on foo (cost=0.00..0.00 rows=1 width=4) -> Seq Scan on foo1 (cost=0.00..2.03 rows=10 width=4) -> Seq Scan on foo2 (cost=0.00..2.03 rows=10 width=4) (7 rows) So Postgres has to merge two large data sets and sort the result, while the optimal plan is to take just one record from each inherited table, sort 2 records and then limit the result. Such optimization will be especially useful in case of using postgres_fdw - when inherited tables are located at remote nodes. Are there any plans to support this optimization or may be somebody is already working on it? Otherwise I can try to investigate it and propose optimizer patch for it. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Limit and inherited tables
This example is lacking indexes on the child tables, which is why the plan shown is about as good as you're going to get. The contents of foo1 and foo2 have to be read in entirety in any case, and sorting them separately is not a win compared to doing a single sort. It is true, but not in case of FDW connected to remote host. In this case sending large volumes of data through network will be very inefficient. There will be no problem if FDW can provide index scan - in this case MergeAppend will fetch only required number of records: postgres=# explain analyze select * from t order by u limit 1; QUERY PLAN --- Limit (cost=300.17..300.23 rows=1 width=8) (actual time=4.588..4.588 rows=1 loops=1) -> Merge Append (cost=300.17..762.76 rows=7681 width=8) (actual time=4.586..4.586 rows=1 loops=1) Sort Key: t.u -> Index Scan using t_pkey on t (cost=0.12..8.14 rows=1 width=8) (actual time=0.003..0.003 rows=0 loops=1) -> Foreign Scan on t_fdw1 (cost=100.00..193.92 rows=2560 width=8) (actual time=1.532..1.532 rows=1 loops=1) -> Foreign Scan on t_fdw2 (cost=100.00..193.92 rows=2560 width=8) (actual time=1.510..1.510 rows=1 loops=1) -> Foreign Scan on t_fdw3 (cost=100.00..193.92 rows=2560 width=8) (actual time=1.535..1.535 rows=1 loops=1) But if sort is performed by non-indexed fields, then current behaviour will be inefficient and can be significantly improved by pushing limits to remote hosts. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Optimizer questions
I am sorry for badly formatted query - I just cut&paste it from C++ client program. I have one more question to community regarding this patch. Proposed patch fix the problem particularly for SORT+LIMIT clauses. In this case evaluation of expressions which are not used in sort is alway waste of time. But I wonder if we should delay evaluation of complex expressions even if there is no LIMIT? Quite often client application doesn't fetch all query results even if there is no LIMIT clause. On 18.01.2016 05:47, Bruce Momjian wrote: On Tue, Jan 5, 2016 at 05:55:28PM +0300, konstantin knizhnik wrote: Hi hackers, I want to ask two questions about PostgreSQL optimizer. I have the following query: SELECT o.id as id,s.id as sid,o.owner,o.creator,o.parent_id as dir_id,s.mime_id,m.c_type,s.p_file,s.mtime,o.ctime,o.name ,o.title,o.size,o.deleted,la.otime,la.etime,uo.login as owner_login,uc.login as creator_login,(CASE WHEN f.user_id IS NULL THEN 0 ELSE 1 END) AS flagged, (select 'userid\\:'||string_agg(user_id,' userid\\:') from get_authorized_users (o.id)) as acl FROM objects s JOIN objects o ON s.original_file=o.id LEFT JOIN flags f ON o.id=f.obj_id AND o.owner=f.user_id LEFT JOIN objects_last_activity la ON o.id = la.obj_id AND o.owner = la.user_id, mime m, users uc , users uo WHERE (s.mime_id=904 or s.mime_id=908) AND m.mime_id = o.mime_id AND o.owner = uo.user_id AND o.creator = uc.user_id ORDER BY s.mtime LIMIT 9; FYI, I could not make any sense out of this query, and I frankly can't figure out how others can udnerstand it. :-O Anyway, I ran it through pgFormatter (https://github.com/darold/pgFormatter), which I am showing here because I was impressed with the results: SELECT o.id AS id, s.id AS sid, o.owner, o.creator, o.parent_id AS dir_id, s.mime_id, m.c_type, s.p_file, s.mtime, o.ctime, o.name, o.title, o.size, o.deleted, la.otime, la.etime, uo.login AS owner_login, uc.login AS creator_login, ( CASE WHEN f.user_id IS NULL THEN 0 ELSE 1 END ) AS flagged, ( SELECT 'userid\\:' || string_agg ( user_id, ' userid\\:' ) FROM get_authorized_users ( o.id ) ) AS acl FROM objects s JOIN objects o ON s.original_file = o.id LEFT JOIN flags f ON o.id = f.obj_id AND o.owner = f.user_id LEFT JOIN objects_last_activity la ON o.id = la.obj_id AND o.owner = la.user_id, mime m, users uc, users uo WHERE ( s.mime_id = 904 OR s.mime_id = 908 ) AND m.mime_id = o.mime_id AND o.owner = uo.user_id AND o.creator = uc.user_id ORDER BY s.mtime LIMIT 9; -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Batch update of indexes
Hi hackers, I want to know opinion of community about possible ways of solving quite common problem: increasing insert speed while still providing indexes for efficient execution of queries. Many applications have to deal with high input stream of data. Most of the time while record inserting in the database is taken for update of indexes. And without indexes we are not able to efficiently execute queries. So in many cases it is desirable to have "batch or concurrent" index update. And it is acceptable that an index is slightly behind current state of the table. One interesting approach of solving this problem is discussed in this article: https://mark.zealey.org/2016/01/08/how-we-tweaked-postgres-upsert-performance-to-be-2-3-faster-than-mongodb Them are using materialized views to build indexes in background. Interesting idea, but copying content of the whole table just to be able to build index concurrently seems to be overkill. I thought about more straightforward ways of solving this problem. It will be nice if we can preserve of of them main postulates of Postgres and other RDBMSes: indexes are just optimization and result of query should not depend on presence of indexes. First idea is to use inheritance. I have investigated different ways of splitting table into "archival" and "operational" parts, but all of them requiring physical copying of data from one table to another. Another idea is to use partial indexes (http://www.postgresql.org/docs/current/static/indexes-partial.html) Assume that we have stream of input data where each record have increased timestamp: create table t( ts timestamp primary key, c1 real, c2 integer, c3 varchar, ... cN char(5) ); We want to provide the highest insert speed for "t" but provide indexes for c1..cN fields. We can declared partial indexes: create index idx1 on t(c1) where ts < '20/01/2016'; create index idx2 on t(c2) where ts < '20/01/2016'; ... create index idxN on t(cN) where ts < '20/01/2016'; As far as this indexes do not cover current date, them will not be affected during insert operations. But we can still efficiently run queries like select * from t where c1>100 and ts < '20/01/2016'; Then, in background, may be at night, we can do alter index idx1 where ts < '21/01/2016'; Please notice that such alter table statement, changing condition for partial index, is not supported now. But I do not see any principle problems with supporting such construction. We should just include in the index all records which match new condition and do not match old condition: ts < '21/01/2016' and not (ts < '20/01/2016') If there is index for "ts" field it can be done quite efficiently. This approach doesn't cause contradictions with concepts of indexes in RDBMS. But there is one more problem with this approach with I think should be addressed. Right now optimizer builds the following execution plan for query with partial indexes: postgres=# explain select * from t where c1 < 10 and ts < '20/01/2016'::timestamp; QUERY PLAN --- Bitmap Heap Scan on t (cost=7.20..732.14 rows=12263 width=12) Recheck Cond: ((c1 < '10'::double precision) AND (ts < '2016-01-20 00:00:00'::timestamp without time zone)) -> Bitmap Index Scan on idx1 (cost=0.00..4.13 rows=12263 width=0) Index Cond: (c1 < '10'::double precision) (4 rows) As you can see optimizer insert recheck in query execution plan while it is not needed in this case: search condition is exactly the same as partial index condition. Optimal plan should be: Index Scan using idx1 on t (cost=0.00..4.13 rows=12263 width=0) Index Cond: (c1 < '10'::double precision) What do you think about this approach? Will it be useful to work in this direction? Or there are some better solutions for the problem? -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Batch update of indexes
Hi, Hi, I glad to see that you interested in that too. I think this is a good feature and I think it will be very useful to have. I have already mentioned some related problems and possible improvements in my presentation. http://www.slideshare.net/AnastasiaLubennikova/indexes-dont-mean-slow-inserts Last two slides concern to this thread. Briefly, I've suggested to think about insertion buffer. Something very similar to it is already implemented in BRIN. It does not index last data from heap, while the number of last pages is less than pages_per_block. Do you mean GIN-like usage of insertion buffer (here it is called "pending list")? So that we have to combine search in the main tree and in the insert buffer? Actually this is what I want to avoided (because at least in case of GIN pending list cause significant degrade of performance, while up-to-date state of full text index is rarely required). The next point, I've thought about is a bulk update. Problem is that update like "UPDATE mytable set a = a+1;" causes N searches from the root of B-tree. I looks very strange to me, and I'd like to fix it somehow. The obvious solution is to update all tuples on the page at a time, and keep the number of last updated page. But, maybe it's a bit off-thread here. Bulk update is the second question (but very important). First I just want to be able to append index concurrently, not blocking insert. One interesting approach of solving this problem is discussed in this article: https://mark.zealey.org/2016/01/08/how-we-tweaked-postgres-upsert-performance-to-be-2-3-faster-than-mongodb Them are using materialized views to build indexes in background. Interesting idea, but copying content of the whole table just to be able to build index concurrently seems to be overkill. This approach seems like a tricky crutch to me. And I agree that it requires a lot of extra work. It will be very interesting to know how people are using materialized views. Delayed building of indexes seems to be one of the popular use cases, although requiring large overhead, first of all storage overhead. Please notice that such alter table statement, changing condition for partial index, is not supported now. Don't you think, that this feature could be used in a very wrong way? Do not take it as criticism, just a bit of thoughts. Everything which can be misused, will be misused:) But I do not worry much about it... If it can address real challenges, then it will be good thing in any case. Ideally we should be able to alter everything. Naive implementation of such alter clause can just to build new index with temporary name, then drop old index and rename new index. There was the discussion of the patch for partial indexes. http://postgresql.nabble.com/PATCH-index-only-scans-with-partial-indexes-td5857568.html Since I haven't watched it closely, It seems to be open still. I think it'll be interesting to you. So small patch... Why it was not accepted? I do no see any problems with it... -- Anastasia Lubennikova Postgres Professional:http://www.postgrespro.com The Russian Postgres Company -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal for UPDATE: do not insert new tuple on heap if update does not change data
Hi, To eliminate creation of new tuple version in this case it is necessary to check that update actually doesn't change the record. It is not a cheapest test and it seems to be not so good idea to perform it always. But if you fill that in your case there are many "identical" updates, you can always explicitly rewrite query by adding extra check: UPDATE foo SET val = 'second' where pk = 2 and val <> 'second'; On 20.01.2016 12:55, Gasper Zejn wrote: Hi, I was wondering if PostgreSQL adds new tuple if data is not changed when using UPDATE. It turns out it does add them and I think it might be beneficial not to add a new tuple in this case, since it causes a great deal of maintenance: updating indexes, vacuuming table and index, also heap fragmentation. How to check: CREATE TABLE foo (pk serial primary key, val text); -- Starting point: two rows. INSERT INTO foo VALUES (1, 'first'); INSERT INTO foo VALUES (2, 'second'); CHECKPOINT; -- Updating row with same value. UPDATE foo SET val = 'second' where pk = 2; CHECKPOINT; -- "Upsert" is the same. INSERT INTO foo VALUES (2, 'second') ON CONFLICT (pk) DO UPDATE SET val = 'second'; CHECKPOINT; If after any checkpoint you look at page data, you can see multiple versions of same row with "second". Unfortunately, I don't believe I can come up with a patch on my own, but will happily offer any further help with testing and ideas. Attached is a script with minimal test case. Kind regards, Gasper Zejn -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Batch update of indexes
On Jan 21, 2016, at 5:14 AM, Simon Riggs wrote: > On 20 January 2016 at 14:55, Konstantin Knizhnik > wrote: > Hi, > > Hi, I glad to see that you interested in that too. > I think this is a good feature and I think it will be very useful to have. > I have already mentioned some related problems and possible improvements in > my presentation. > http://www.slideshare.net/AnastasiaLubennikova/indexes-dont-mean-slow-inserts > Last two slides concern to this thread. Briefly, I've suggested to think > about insertion buffer. Something very similar to it is already implemented > in BRIN. It does not index last data from heap, while the number of last > pages is less than pages_per_block. > > Do you mean GIN-like usage of insertion buffer (here it is called "pending > list")? > So that we have to combine search in the main tree and in the insert buffer? > Actually this is what I want to avoided (because at least in case of GIN > pending list cause significant degrade of performance, > while up-to-date state of full text index is rarely required). > > Degrade in performance is because scan of pending list is O(N). > > If we did the same thing for monotonic inserts into a btree, the performance > of ruling out any contents in the pending list would be O(1), so it is more > feasible than you say. Sorry, didn't get it. Certainly for B-Tree we can organize insert buffer (or pending list) as sorted array or also as a tree. But in both case complexity of search in this buffer will be O(log(N)), not O(1). O(1) is possible only if we will use hash table for pending list and are lucky with hash function. But in this case it will be not possible to support range queries and proper order. In any case, necessity to perform two searches instead of one and combining results will significantly complicate index implementation. But definitely this solution is more convenient and transparent for users... > > -- > Simon Riggshttp://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Batch update of indexes
On 21.01.2016 10:14, Simon Riggs wrote: On 21 January 2016 at 06:41, konstantin knizhnik mailto:k.knizh...@postgrespro.ru>> wrote: Certainly for B-Tree we can organize insert buffer (or pending list) as sorted array or also as a tree. But in both case complexity of search in this buffer will be O(log(N)), not O(1). You are right; search within this buffer is O(log(N)). But we can test whether we need to search in the buffer at all with O(1). Only if records are inserted in key ascending order... But usually we need to maintain more than once index and and for them it is not true. -- Simon Riggs http://www.2ndQuadrant.com/ <http://www.2ndquadrant.com/> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Batch update of indexes
On 21.01.2016 19:09, Anastasia Lubennikova wrote: What I meant is more like a BRIN-like combination of an index scan and heap scan. Maybe it could be called "deferred inserts" or "temporary read-only index" Maybe it's similar with mysql insert buffer http://dev.mysql.com/doc/refman/5.7/en/innodb-insert-buffering.html I think it'll be more clear with example. Please don't care about syntax. CREATE TABLE tbl (c1 int); CREATE INDEX idx on tbl(c1); SET enable_deferred_insert(idx) = on; At this moment, we save the last_indexed_item (its TID) somewhere in index metapage. Since that moment, the data inserted into the table doesn't touch the index. We perform some heavy insert and then go back to the normal index behavior. SET enable_deferred_insert(idx) = off; This command takes all the data between the last_indexed_item and the end of the table, and inserts it into the index at a time. Of course there are new problems to deal with, but it's really useful for the use case to balance irregular heavy write load, isn't it? BTW, could you explain, what is the reason to copy data into the pending list and then copy it again while flushing pending list into the index? Why not read this data directly from the table? I feel that I've missed something important here. No, I do not think that inserted data should be placed in pending list and then copied to main table. It should be stored directly in the main table and "pending list" is just some fast, transient index. From my point of view there are two possibilities: 1. Preserve strict table-index consistency: query results should not depend on presence of the index 2. Support out-of-date or deferred indexes, which can be updated in background. Second approach is certainty more efficient and IMHO it acceptable for most of the users. But we are violating one of the fundamental properties of RDBMes... So I am not sure which approach to chose. First case is also harder to implement, because we have to somehow merge two indexes during index scan and provide proper recovery of main index in case of failure (assuming that pending list is maintained in memory and is lost after the fault). -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Batch update of indexes
Hi hackers, I have implemented "ALTER INDEX ... WHERE ..." clause allowing to change condition for partial index. Actually it allows us to append index without fully rebuilding it. As I explained in the previous mails, partial indexes can be used to increase insert speed. Right now I get the following results (with one insert stream): Insert with 1 index (primary key, monotonically ascending): 324275 TPS Insert with 9 indexes (primary key + 8 indexes with random keys): 52495 TPS Insert with primary key and 8 concurrently updated partial indexes: 194458 TPS Insert with primary key and 8 "frozen" partial indexes: 278446 TPS So, as you can see insert with indexes is about 6 times slower than insert without indexes. And partial indexes allows to eliminate this gap. When partial indexes are not affected (assuming that them will be reconstructed "at night"), performance is almost the same, as without indexes. And if "ALTER INDEX" is done concurrently with inserts, it certainly decrease insert speed, but still it is 4 times faster than with normal indexes. Such high TPS values were obtained using "insert from select" to bypass libpq overhead. With libpq (when each insert is sent as independent statement) results are less impressive: Insert with 1 index (primary key, monotonically ascending): 37892 TPS Insert with 9 indexes (primary key + 8 indexes with random keys): 20231 TPS Insert with primary key and 8 concurrently updated partial indexes: 26934 TPS Insert with primary key and 8 "frozen" partial indexes: 28863 TPS But still partial indexes allows to almost eliminate two times differences... This results can be reproduced using our public repository: https://github.com/postgrespro/postgres_cluster Most of the code related with support of "ALTER INDEX .. WHERE" is in AlterIndex function in postgres_cluster/src/backend/commands/indexcmds.c I have also added insbench utility for measuring insert performance, using which this results were obtained. It is located in postgres_cluster/src/bin/insbench directory. Known issues: 1. I do not handle case when new condition for partial index is more restricted than original. There is no way in Postgres to exclude records from index (except VACUUM), so in this case index has to be reconstructed from scratch. 2. Currently I am using SPI to locate records which should be included in index. 3. I am not completely sure that there are no synchronization/isolation problems in AlterIndex function If this approach is considered to be interesting by community, I will try to address these issues. On 20.01.2016 12:28, Konstantin Knizhnik wrote: Hi hackers, I want to know opinion of community about possible ways of solving quite common problem: increasing insert speed while still providing indexes for efficient execution of queries. Many applications have to deal with high input stream of data. Most of the time while record inserting in the database is taken for update of indexes. And without indexes we are not able to efficiently execute queries. So in many cases it is desirable to have "batch or concurrent" index update. And it is acceptable that an index is slightly behind current state of the table. One interesting approach of solving this problem is discussed in this article: https://mark.zealey.org/2016/01/08/how-we-tweaked-postgres-upsert-performance-to-be-2-3-faster-than-mongodb Them are using materialized views to build indexes in background. Interesting idea, but copying content of the whole table just to be able to build index concurrently seems to be overkill. I thought about more straightforward ways of solving this problem. It will be nice if we can preserve of of them main postulates of Postgres and other RDBMSes: indexes are just optimization and result of query should not depend on presence of indexes. First idea is to use inheritance. I have investigated different ways of splitting table into "archival" and "operational" parts, but all of them requiring physical copying of data from one table to another. Another idea is to use partial indexes (http://www.postgresql.org/docs/current/static/indexes-partial.html) Assume that we have stream of input data where each record have increased timestamp: create table t( ts timestamp primary key, c1 real, c2 integer, c3 varchar, ... cN char(5) ); We want to provide the highest insert speed for "t" but provide indexes for c1..cN fields. We can declared partial indexes: create index idx1 on t(c1) where ts < '20/01/2016'; create index idx2 on t(c2) where ts < '20/01/2016'; ... create index idxN on t(cN) where ts < '20/01/2016'; As far as this indexes do not cover current d
Re: [HACKERS] Optimizer questions
Unfortunately this two statements are not equivalent: second one can (in theory, but not for this particular data set) return more than 10 result records. Such optimization will be correct if t2.i is declared as unique. But the most efficient plan for this query will be generated if there is index for t1.v. In this case no explicit sot is needed. Limit is still not pushed down, but it is not a problem because nested join is used which is not materializing its result (produces records on demand): # explain analyze select * from t1 left outer join t2 on t1.k=t2.k order by t1.v limit 10; QUERY PLAN -- Limit (cost=0.58..4.38 rows=10 width=16) (actual time=0.069..0.157 rows=10 loops=1) -> Nested Loop Left Join (cost=0.58..37926.63 rows=11 width=16) (actual time=0.067..0.154 rows=10 loops=1) -> Index Scan using idxv on t1 (cost=0.29..3050.31 rows=11 width=8) (actual time=0.046..0.053 rows=10 loops=1) -> Index Scan using t2_pkey on t2 (cost=0.29..0.34 rows=1 width=8) (actual time=0.007..0.007 rows=1 loops=10) Index Cond: (t1.k = k) Planning time: 0.537 ms Execution time: 0.241 ms (7 rows) On 01/30/2016 01:01 AM, Alexander Korotkov wrote: On Fri, Jan 8, 2016 at 11:58 AM, Konstantin Knizhnik mailto:k.knizh...@postgrespro.ru>> wrote: Attached please find improved version of the optimizer patch for LIMIT clause. Now I try to apply this optimization only for non-trivial columns requiring evaluation. May be it will be better to take in account cost of this columns evaluation but right now I just detect non-variable columns. We may think about more general feature: LIMIT pushdown. In the Konstantin's patch planner push LIMIT before tlist calculation. But there are other cases when calculating LIMIT first would be beneficial. For instance, we can do LIMIT before JOINs. That is possible only for LEFT JOIN which is not used in filter and ordering clauses. See the example below. # create table t1 as (select i, random() v from generate_series(1,100) i); SELECT 100 # create table t2 as (select i, random() v from generate_series(1,100) i); SELECT 100 # explain analyze select * from t1 left join t2 on t1.i = t2.i order by t1.v limit 10; QUERY PLAN Limit (cost=87421.64..87421.67 rows=10 width=24) (actual time=1486.276..1486.278 rows=10 loops=1) -> Sort (cost=87421.64..89921.64 rows=100 width=24) (actual time=1486.275..1486.275 rows=10 loops=1) Sort Key: t1.v Sort Method: top-N heapsort Memory: 25kB -> Hash Left Join (cost=27906.00..65812.00 rows=100 width=24) (actual time=226.180..1366.238 rows=100 Hash Cond: (t1.i = t2.i) -> Seq Scan on t1 (cost=0.00..15406.00 rows=100 width=12) (actual time=0.010..77.041 rows=100 l -> Hash (cost=15406.00..15406.00 rows=100 width=12) (actual time=226.066..226.066 rows=100 loop Buckets: 131072 Batches: 1 Memory Usage: 46875kB -> Seq Scan on t2 (cost=0.00..15406.00 rows=100 width=12) (actual time=0.007..89.002 rows=100 Planning time: 0.123 ms Execution time: 1492.118 ms (12 rows) # explain analyze select * from (select * from t1 order by t1.v limit 10) t1 left join t2 on t1.i = t2.i; QUERY PLAN Hash Right Join (cost=37015.89..56171.99 rows=10 width=24) (actual time=198.478..301.278 rows=10 loops=1) Hash Cond: (t2.i = t1.i) -> Seq Scan on t2 (cost=0.00..15406.00 rows=100 width=12) (actual time=0.005..74.303 rows=100 loops=1) -> Hash (cost=37015.77..37015.77 rows=10 width=12) (actual time=153.584..153.584 rows=10 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 1kB -> Limit (cost=37015.64..37015.67 rows=10 width=12) (actual time=153.579..153.580 rows=10 loops=1) -> Sort (cost=37015.64..39515.64 rows=100 width=12) (actual time=153.578..153.578 rows=10 loops=1) Sort Key: t1.v Sort Method: top-N heapsort Memory: 25kB -> Seq Scan on t1 (cost=0.00..15406.00 rows=100 width=12) (actual time=0.012..78.828 rows=100 Planning time: 0.132 ms Execution time: 301.308 ms (12 rows) In this example LIMIT pushdown makes query 5 times faster. It would be very nice if optimizer make this automatically. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com <http://www.postgrespro.com/> The Russian Postgres Company -- Konstantin Knizhnik Postgres Profe
Re: [HACKERS] PATCH: index-only scans with partial indexes
I am very interested in this patch because it allows to use partial indexes to ... speed up inserts. I have implemented "ALTER INDEX ... WHERE ..." construction which allows to change predicate of partial index without necessity to fully rebuild it. So it is not necessary to insert new records in index immediately (if new records do not match partial index conditions). It can be done later in background (or at night). My experiments show that it allows to increase insert speed five times (for either partial indexes). At the same time we do not loose RDBMS requirement that result of query should not depend on presence of indexes. And it is applicable to all indexes: B-Tree, GIN, GIST,... But such optimization makes sense only of partial indexes can be used without extra overhead, first of all for index-only scans. And it is impossible without this patch. On 01/31/2016 03:34 PM, Alvaro Herrera wrote: Tomas Vondra wrote: On 12/24/2015 04:05 AM, Michael Paquier wrote: Tomas, are you still working on that? This thread is stalling for 3 weeks. I haven't discovered anything interesting during the testing, so I guess the "needs review" state is appropriate. Let's move the patch to the next commitfest. Not sure what to do here, since this patch got no feedback at all in this CF. The right thing to do, ISTM, is to just move it again to the next CF. But it'd be really useful if someone can have it a look and verify at least whether it doesn't need a rebase (requiring a further submission) so that other people can play with it. Of course, if Horiguchi-san or anyone has more review comments, that would be even better. Tomas said he'd do more testing, but we never got a report on whether anything turned up. (At this point I'm not sure if either Kyotaro or Tomas should be considered the patch author ... maybe both?) -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: index-only scans with partial indexes
I have applied this patch to our working branch and during several weeks we ran various tests and benchmarks. We have not noticed any problems or performance degradation. And at some queries this patch cause very significant increase of performance - ten times: With this patch: postgres=# explain analyze select count(*) from t where k1<100 and pk < 1454434742881892; QUERY PLAN Aggregate (cost=29.65..29.66 rows=1 width=0) (actual time=0.108..0.108 rows=1 loops=1) -> Index Only Scan using idx1 on t (cost=0.43..27.49 rows=861 width=0) (actual time=0.012..0.071 rows=963 loops=1) Index Cond: (k1 < 100) Heap Fetches: 0 Planning time: 0.100 ms Execution time: 0.121 ms (6 rows) Without patch: postgres=# explain analyze select count(*) from t where k1<100 and pk < 1454434742881892; QUERY PLAN Aggregate (cost=2951.55..2951.56 rows=1 width=0) (actual time=1.070..1.070 rows=1 loops=1) -> Bitmap Heap Scan on t (cost=19.10..2949.40 rows=861 width=0) (actual time=0.161..0.997 rows=963 loops=1) Recheck Cond: ((k1 < 100) AND (pk < '1454434742881892'::bigint)) Heap Blocks: exact=954 -> Bitmap Index Scan on idx1 (cost=0.00..18.88 rows=861 width=0) (actual time=0.083..0.083 rows=963 loops=1) Index Cond: (k1 < 100) Planning time: 0.099 ms Execution time: 1.089 ms (8 rows) On 01.02.2016 01:11, Alvaro Herrera wrote: Konstantin Knizhnik wrote: I am very interested in this patch because it allows to use partial indexes to ... speed up inserts. I have implemented "ALTER INDEX ... WHERE ..." construction which allows to change predicate of partial index without necessity to fully rebuild it. So it is not necessary to insert new records in index immediately (if new records do not match partial index conditions). It can be done later in background (or at night). My experiments show that it allows to increase insert speed five times (for either partial indexes). At the same time we do not loose RDBMS requirement that result of query should not depend on presence of indexes. And it is applicable to all indexes: B-Tree, GIN, GIST,... But such optimization makes sense only of partial indexes can be used without extra overhead, first of all for index-only scans. And it is impossible without this patch. That sounds interesting. So please review this patch and let us know whether you like it, or whether you have any better ideas for any particular hunk, or whether you think it should be rewritten from scratch, or just state that it is perfect. If you think it's useful, then it's a good idea to vouch for it to be integrated; and one way of doing that is making sure it meets the quality standards etc. If you don't say anything about the patch, we may never integrate it because we might have doubts about whether it's worthy. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Batch update of indexes
Attached please find patch for "ALTER INDEX ... WHERE ..." clause. It is now able to handle all three possible situations: 1. Making index partial (add WHERE condition to the ordinary index) 2. Extend partial index range (less restricted index predicate) 3. Arbitrary change of partial index predicate In case 2) new records are added to the index. In other two cases index is completely reconstructed. This patch includes src/bin/insbench utility for testing insert performance. It can be easily excluded from the patch to reduce it size. Also it is better to apply this patch together with "index-only scans with partial indexes" patch: http://www.postgresql.org/message-id/560c7213.3010...@2ndquadrant.com only in this case regression test will produce expected output. On 27.01.2016 23:15, Robert Haas wrote: On Wed, Jan 20, 2016 at 4:28 AM, Konstantin Knizhnik wrote: Please notice that such alter table statement, changing condition for partial index, is not supported now. But I do not see any principle problems with supporting such construction. We should just include in the index all records which match new condition and do not match old condition: ts < '21/01/2016' and not (ts < '20/01/2016') You'd also need to remove any rows from the index that match the old condition but not the new one. In your example, that's impossible, but in general, it's definitely possible. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c index b450bcf..b6ffb19 100644 --- a/src/backend/commands/indexcmds.c +++ b/src/backend/commands/indexcmds.c @@ -32,6 +32,7 @@ #include "commands/tablespace.h" #include "mb/pg_wchar.h" #include "miscadmin.h" +#include "funcapi.h" #include "nodes/nodeFuncs.h" #include "optimizer/clauses.h" #include "optimizer/planner.h" @@ -50,6 +51,9 @@ #include "utils/snapmgr.h" #include "utils/syscache.h" #include "utils/tqual.h" +#include "utils/ruleutils.h" +#include "executor/executor.h" +#include "executor/spi.h" /* non-export function prototypes */ @@ -275,6 +279,160 @@ CheckIndexCompatible(Oid oldId, return ret; } +static void +UpdateIndex(Oid indexRelationId, Node* whereClause) +{ + Datum values[Natts_pg_index]; + bool isnull[Natts_pg_index]; + HeapTuple oldTuple; + HeapTuple newTuple; + Relation pg_index; + + pg_index = heap_open(IndexRelationId, RowExclusiveLock); + oldTuple = SearchSysCacheCopy1(INDEXRELID, ObjectIdGetDatum(indexRelationId)); + if (!HeapTupleIsValid(oldTuple)) + elog(ERROR, "cache lookup failed for index %u", indexRelationId); + + heap_deform_tuple(oldTuple, RelationGetDescr(pg_index), values, isnull); + values[Anum_pg_index_indpred - 1] = CStringGetTextDatum(nodeToString(whereClause)); + isnull[Anum_pg_index_indpred - 1] = false; + newTuple = heap_form_tuple(RelationGetDescr(pg_index), values, isnull); + simple_heap_update(pg_index, &oldTuple->t_self, newTuple); + CatalogUpdateIndexes(pg_index, newTuple); + heap_freetuple(newTuple); + heap_freetuple(oldTuple); + heap_close(pg_index, NoLock); +} + +void +AlterIndex(Oid indexRelationId, IndexStmt *stmt) +{ + char* select; + Oid heapRelationId; + IndexUniqueCheck checkUnique; + Datum values[INDEX_MAX_KEYS]; + bool isnull[INDEX_MAX_KEYS]; + Relation heapRelation; + Relation indexRelation; +SPIPlanPtr plan; +Portal portal; + HeapTuple tuple; + TupleTableSlot *slot; + ItemPointer tupleid; + IndexInfo *indexInfo; + EState *estate; + Oid namespaceId; + List* deparseCtx; + char* oldIndexPredicate; + char* newIndexPredicate; + char* relationName; + + Assert(stmt->whereClause); + CheckPredicate((Expr *) stmt->whereClause); + + /* Open and lock the parent heap relation */ + heapRelationId = IndexGetRelation(indexRelationId, false); + heapRelation = heap_open(heapRelationId, AccessShareLock); + + /* Open the target index relation */ + /* indexRelation = index_open(indexRelationId, RowExclusiveLock); */ + indexRelation = index_open(indexRelationId, ShareUpdateExclusiveLock); + /* indexRelation = index_open(indexRelationId, AccessShareLock); */ + namespaceId = RelationGetNamespace(indexRelation); + + indexInfo = BuildIndexInfo(indexRelation); + Assert(!indexInfo->ii_ExclusionOps); + + /* + * Generate the constraint and default execution states + */ + estate = CreateExecutorState(); + + checkUnique = indexRelation->rd_index->indisunique ? UNIQUE_CHECK_YES : UNIQUE_CHECK_NO; + + slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation)); + + deparseCtx = deparse_context_for(RelationGetRelationName(heapRelation), heapRelationId); + relationName = quote_qualified_identifier(get_namespace_name(namespaceId), +
Re: [HACKERS] Batch update of indexes
On Feb 4, 2016, at 2:00 AM, Jim Nasby wrote: > > My suspicion is that it would be useful to pre-order the new data before > trying to apply it to the indexes. Sorry, but ALTER INDEX is expected to work for all indexes, not only B-Tree, and for them sorting may not be possible... But for B-Tree presorting inserted data should certainly increase performance. I will think about it. > -- > Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX > Experts in Analytics, Data Architecture and PostgreSQL > Data in Trouble? Get it in Treble! http://BlueTreble.com > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] eXtensible Transaction Manager API (v2)
Hi, PostgresProffesional cluster teams wants to propose new version of eXtensible Transaction Manager API. Previous discussion concerning this patch can be found here: http://www.postgresql.org/message-id/f2766b97-555d-424f-b29f-e0ca0f6d1...@postgrespro.ru The API patch itself is small enough, but we think that it will be strange to provide just API without examples of its usage. We have implemented various implementations of distributed transaction manager based on this API: pg_dtm (based ion snapshot sharing) and pg_tsdtm (CSN based on local system time). Based on this two DTM implementation we have developed various "cluster" implementations: multimaster+pg_dtm, multimaster+pg_tsdtm, pg_shard+pg_dtm, pg_shard+pg_tsdtm, postgres_fdw+pg_dtm, postgres_fdw+pg+tsdtm,... Multimaster is based on logical replication is something like BDR but synchronous: provide consistency across cluster. But we want to make this patch as small as possible. So we decided to include in it only pg_tsdtm and patch of postgres_fdw allowing to use it with pg_tsdtm. pg_tsdtm is simpler than pg_dtm because last one includes arbiter with RAFT protocol (centralized service) and sockhub for efficient multiplexing backend connections. Also, in theory, pg_tsdtm provides better scalability, because it is decentralized. Architecture of DTM and tsDTM as well as benchmark results are available at WiKi page: https://wiki.postgresql.org/wiki/DTM Please notice pg-tsdtm is just reference implementation of DTM using this XTM API. The primary idea of this patch is to add XTM API to PostgreSQL code, allowing to implement own transaction managers as Postgres extension. So please review first of all XTM API itself and not pg_tsdtm which is just and example of its usage. The complete PostgreSQL branch with all our changes can be found here: https://github.com/postgrespro/postgres_cluster -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/contrib/pg_tsdtm/Makefile b/contrib/pg_tsdtm/Makefile new file mode 100644 index 000..e70dffc --- /dev/null +++ b/contrib/pg_tsdtm/Makefile @@ -0,0 +1,20 @@ +MODULE_big = pg_tsdtm +OBJS = pg_tsdtm.o + +EXTENSION = pg_tsdtm +DATA = pg_tsdtm--1.0.sql + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = contrib/pg_tsdtm +top_builddir = ../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif + +check: + env DESTDIR='$(abs_top_builddir)'/tmp_install make install + $(prove_check) diff --git a/contrib/pg_tsdtm/dtm_recovery/dtm_recovery.cpp b/contrib/pg_tsdtm/dtm_recovery/dtm_recovery.cpp new file mode 100644 index 000..38285be --- /dev/null +++ b/contrib/pg_tsdtm/dtm_recovery/dtm_recovery.cpp @@ -0,0 +1,129 @@ +#include +#include +#include +#include + +#include +#include +#include + +using namespace std; +using namespace pqxx; + +int main (int argc, char* argv[]) +{ +if (argc == 1){ +printf("Use -h to show usage options\n"); +return 1; +} +vector connections; +set prepared_xacts; +set committed_xacts; +bool verbose = false; +for (int i = 1; i < argc; i++) { +if (argv[i][0] == '-') { +switch (argv[i][1]) { + case 'C': + case 'c': +connections.push_back(string(argv[++i])); +continue; + case 'v': +verbose = true; +continue; +} +} +printf("Perform recovery of pg_tsdtm cluster.\n" + "Usage: dtm_recovery {options}\n" + "Options:\n" + "\t-c STR\tdatabase connection string\n" + "\t-v\tverbose mode: print extra information while processing\n"); +return 1; +} +if (verbose) { +cout << "Collecting information about prepared transactions...\n"; +} +for (vector::iterator ic = connections.begin(); ic != connections.end(); ++ic) +{ +if (verbose) { +cout << "Connecting to " << *ic << "...\n"; +} +connection con(*ic); +work txn(con); +result r = txn.exec("select gid from pg_prepared_xacts"); +for (result::const_iterator it = r.begin(); it != r.end(); ++it) +{ +string gid = it.at("gid").as(string()); +prepared_xacts.insert(gid); +} +txn.commit(); +} +if (verbose) { +cout << "Prepared transactions: "; +for (set::iterator it = prepared_xacts.begin(); it != prepared_xacts.end(); ++it) +{ +cout << *it << ", "; +} +
[HACKERS] Clock with Adaptive Replacement
Hi hackers, What do you think about improving cache replacement clock-sweep algorithm in PostgreSQL with adaptive version proposed in this article: http://www-cs.stanford.edu/~sbansal/pubs/fast04.pdf Are there some well known drawbacks of this approach or it will be interesting to adopt this algorithm to PostgreSQL and measure it impact om performance under different workloads? I find this ten years old thread: http://www.postgresql.org/message-id/flat/d2jkde$6bg$1...@sea.gmane.org#d2jkde$6bg$1...@sea.gmane.org but it mostly discus possible patent issues with another algorithm ARC (CAR is inspired by ARC, but it is different algorithm). As far as I know there are several problems with current clock-sweep algorithm in PostgreSQL, especially for very large caches. May be CAR can address some of them? -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock with Adaptive Replacement
Thank you very much for response. I am not sure that CART can significantly improve PostgreSQL performance - I just want to know opinion of community about CAR/CART and other possible alternative to GCLOCK algorithm. Looks like it CAR really provides better cache hit ratio and so at some workloads should increase Postgres performance. But now amount of memory at servers is large enough to completely keep most of typical databases in cache. So time of locating buffer in cache is more critical then time of buffer eviction. And here CART doesn't provide any benefits comparing with GCLOCK algorithm. One of the problems with GCLOCK algorithm from my point of view is that for large caches, containing larger number of pages locating victim page can take substantial amount of time, because we have to perform several turnovers before some count becomes zero. In theory CART can address this problem because there are not counters - justs single bit per page. On 12.02.2016 18:55, Robert Haas wrote: On Thu, Feb 11, 2016 at 4:02 PM, Konstantin Knizhnik wrote: What do you think about improving cache replacement clock-sweep algorithm in PostgreSQL with adaptive version proposed in this article: http://www-cs.stanford.edu/~sbansal/pubs/fast04.pdf Are there some well known drawbacks of this approach or it will be interesting to adopt this algorithm to PostgreSQL and measure it impact om performance under different workloads? I find this ten years old thread: http://www.postgresql.org/message-id/flat/d2jkde$6bg$1...@sea.gmane.org#d2jkde$6bg$1...@sea.gmane.org but it mostly discus possible patent issues with another algorithm ARC (CAR is inspired by ARC, but it is different algorithm). As far as I know there are several problems with current clock-sweep algorithm in PostgreSQL, especially for very large caches. May be CAR can address some of them? Maybe, but the proof of the pudding is in the eating. Just because an algorithm is smarter, newer, and better in general than our current algorithm - and really, it wouldn't be hard - doesn't mean that it will actually solve the problems we care about. A few of my EnterpriseDB colleagues spent a lot of time benchmarking various tweaks to our current algorithm last year and were unable to construct a test case where it sped anything up. If they tried the same tweaks against the 9.4 source base, they could get a speedup. But 9.5 had locking improvements around buffer eviction, and with those improvements committed there was no longer any measurable benefit to improving the quality of buffer eviction decisions. That's a surprising result, to me anyway, and somebody else might well find a test case where a benefit can be shown - but our research was not successful. I think it's important to spend time and energy figuring out exactly what the problems with our current algorithm are. We know in general terms that usage counts tend to converge to either 5 or 0 and therefore sometimes evict buffers both at great cost and almost randomly. But what's a lot less clear is how much that actually hurts us given that we are relying on the OS cache anyway. It may be that we need to fix some other things before or after improving the buffer eviction algorithm before we actually get a performance benefit. I suspect, for example, that a lot of the problems with large shared_buffers settings have to do with the bgwriter and checkpointer behavior rather than with the buffer eviction algorithm; and that others have to do with cache duplication between PostgreSQL and the operating system. So, I would suggest (although of course it's up to you) that you might want to focus on experiments that will help you understand where the problems are before you plunge into writing code to fix them. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pglogical - logical replication contrib module
s some rules like "don't create FKs pointing from non-replicated tables to tables replicated from somewhere else". A concept we currently have no way to express or enforce like we do persistent-to-UNLOGGED FKs. Then there's global objects. Something as simple as: CREATE ROLE fred; CREATE TABLE blah(...) OWNER fred; will break replication because we only see the CREATE TABLE, not the CREATE ROLE. If we instead replayed the CREATE ROLE and there were multiple connections between different DBs on an upstream and downstream apply would fail on all but one. But we can't anyway since there's no way to capture that CREATE ROLE from any DB except the one it was executed in, which might not even be one of the ones doing replication. I strongly suspect we'll need logical decoding to be made aware of such global DDL and decode it from the WAL writes to the system catalogs. Which will be fun - but at least modifications to the shared catalogs are a lot simpler than the sort of gymnastics done by ALTER TABLE, etc. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] pglogical - logical replication contrib module
Ok, what about the following plan: 1. Support custom WAL records (as far as I know 2ndQuadrant has such patch). 2. Add one more function to logical decoding allowing to deal with custom records. So the idea is that we somehow record DDL in WAL (for example using executor hook), then them are proceeded using logical decoding, calling special logical deocding plugin function to handle this records. For example we can store DDL in WAL just as SQL statements and so easily replay them. In this case DDL will be replicated using the same mechanism and through the same channel as DML. On 17.02.2016 12:16, Craig Ringer wrote: On 17 February 2016 at 16:24, Konstantin Knizhnik mailto:k.knizh...@postgrespro.ru>> wrote: Thanks for your explanation. I have to agree with your arguments that in general case replication of DDL statement using logical decoding seems to be problematic. But we are mostly considering logical decoding in quite limited context: replication between two identical Postgres database nodes (multimaster). Yep, much like BDR. Where all this infrastructure came from and is/was aimed at. Do you think that it in this case replication of DLL can be done as sequence of low level operations with system catalog tables including manipulation with locks? No. For one thing logical decoding doesn't see catalog tuple changes right now. Though I imagine that could be changed easily enough. More importantly - oids. You add a column to a table: ALTER TABLE mytable ADD COLUMN mycolumn some_type UNIQUE NOT NULL DEFAULT some_function() This writes to catalogs including: pg_attribute pg_constraint pg_index pg_class (for the index relation) ... probably more. It also refers to pg_class (for the definition of mytable), pg_type (definition of some_type), pg_proc (definition of some_function), the b-tree operator class for some_type in pg_opclass, the b-tree indexam in pg_am, ... more. Everything is linked by oids, and the oids are all node local. You can't just blindly re-use them. If "some_type" is hstore, the oid of hstore in pg_type might be different on the upstream and downstream. The only exception is the oids of built-in types and even then that's not guaranteed across major versions. So if you blindly replicate catalog row changes you'll get a horrible mess. That's before considering a table's relfilenode, which is initially the same as its oid, but subject to change if truncated or rewritten. To even begin to do this half-sanely you'd have to maintain a mapping of upstream object oids->names on the downstream, with invalidations replicated from the upstream. That's only the beginning. There's handling of extensions and lots more fun. So in your example with ALTER TABLE statement, can we correctly replicate it to other nodes as request to set exclusive lock + some manipulations with catalog tables and data table itself? Nope. No hope, not unless "some manipulations with catalog tables and data table its self" is a lot more comprehensive than I think you mean. 1. Add option whether to include operations on system catalog tables in logical replication or not. I would like to have this anyway. 2. Make it possible to replicate lock requests (can be useful not only for DDLs) I have no idea how you'd even begin to do that. I looked how DDL was implemented in BDR and did it in similar way in our multimaster. But it is awful: we need to have two different channels for propagating changes. Yeah, it's not beautiful, but maybe you misunderstood something? The DDL is written to a table, and that table's changes are replayed along with everything else. It's consistent and keeps DDL changes as part of the xact that performed them. Maybe you misunderstood how it works in BDR and missed the indirection via a table? Additionally, in multimaster we want to enforce cluster wide ACID. It certainly includes operations with metadata. It will be very difficult to implement if replication of DML and DDL is done in two different ways... That's pretty much why BDR does it this way, warts and all. Though it doesn't offer cluster-wide ACID it does need atomic commit of xacts that may contain DML, DDL, or some mix of the two. Let me ask one more question concerning logical replication: how difficult it will be from your point of view to support two phase commit in logical replication? Are there some principle problems? I haven't looked closely yet. Andres will know more. I very, very badly want to be able to decode 2PC prepared xacts myself. The main issue I'm aware of is locking - specifically the inability to impersonate another backend and treat locks held by that backend (which might be a fake backend for a pg_prepared_xacts entry)
Re: [HACKERS] The plan for FDW-based sharding
Once that is done, we can see what workloads it covers and decide if we are willing to copy the volume of code necessary to implement all supported Postgres XC or XL workloads. (The Postgres XL license now matches the Postgres license, http://www.postgres-xl.org/2015/07/license-change-and-9-5-merge/. Postgres XC has always used the Postgres license.) If we are not willing to add code for the missing Postgres XC/XL features, Postgres XC/XL will probably remain a separate fork of Postgres. I don't think anyone knows the answer to this question, and I don't know how to find the answer except to keep going with our current FDW sharding approach. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Relation cache invalidation on replica
The reason of the problem is that invalidation messages are not delivered to replica after the end of concurrent create index. Invalidation messages are included in xlog as part of transaction commit record. Concurrent index create is split into three transaction, last of which is just performing inplace update of index tuple, marking it as valid and invalidating cache. But as far as this transaction is not assigned XID, no transaction record is created in WAL and send to replicas. As a result, replica doesn't receive this invalidation messages. To fix the problem it is just enough to assign XID to transaction. It can be done by adding GetCurrentTransactionId() call to the end of DefineIdnex function: diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c index 13b04e6..1024603 100644 --- a/src/backend/commands/indexcmds.c +++ b/src/backend/commands/indexcmds.c @@ -881,6 +881,12 @@ DefineIndex(Oid relationId, CacheInvalidateRelcacheByRelid(heaprelid.relId); /* + * Force WAL commit record to ensure that replica receives invalidation + * messages. + */ +GetCurrentTransactionId(); + +/* * Last thing to do is release the session-level lock on the parent table. */ UnlockRelationIdForSession(&heaprelid, ShareUpdateExclusiveLock); On 26.02.2016 15:41, Васильев Дмитрий wrote: Session opened on replica doesn't see concurrently created indexes at this time on master. We have master and replica: 1. master: pgbench -i -s 10 2. replica: explain (analyze,verbose) select * from pgbench_accounts where abalance = 1; 3. master: ALTER INDEX pgbench_accounts_abalance_idx RENAME TO pgbench_accounts_abalance_idx_delme; CREATE INDEX CONCURRENTLY pgbench_accounts_abalance_idx ON pgbench_accounts USING btree (abalance); DROP INDEX pgbench_accounts_abalance_idx_delme; 4. at this time on replica: explain (analyze,verbose) select * from pgbench_accounts where abalance = 1; pgbench=# explain (analyze,verbose) select * from pgbench_accounts where abalance = 1; QUERY PLAN Index Scan using pgbench_accounts_abalance_idx on public.pgbench_accounts (cost=0.42..4.44 rows=1 width=97) (actual time=655.781..655.781 rows=0 loops=1) Output: aid, bid, abalance, filler Index Cond: (pgbench_accounts.abalance = 1) Planning time: 9388.259 ms Execution time: 655.900 ms (5 rows) pgbench=# explain (analyze,verbose) select * from pgbench_accounts where abalance = 1; QUERY PLAN -- Index Scan using pgbench_accounts_abalance_idx_delme on public.pgbench_accounts (cost=0.42..4.44 rows=1 width=97) (actual time=0.014..0.014 rows=0 loops=1) Output: aid, bid, abalance, filler Index Cond: (pgbench_accounts.abalance = 1) Planning time: 0.321 ms Execution time: 0.049 ms (5 rows) pgbench=# explain (analyze,verbose) select * from pgbench_accounts where abalance = 1; QUERY PLAN Seq Scan on public.pgbench_accounts (cost=0.00..28894.00 rows=1 width=97) (actual time=3060.451..3060.451 rows=0 loops=1) Output: aid, bid, abalance, filler Filter: (pgbench_accounts.abalance = 1) Rows Removed by Filter: 100 Planning time: 0.087 ms Execution time: 3060.484 ms (6 rows) pgbench=# \d+ pgbench_accounts Table "public.pgbench_accounts" Column | Type | Modifiers | Storage | Stats target | Description --+--- aid | integer | not null | plain | | bid | integer | | plain | | abalance | integer | | plain | | filler | character(84) | | extended | | Indexes: "pgbench_accounts_pkey" PRIMARY KEY, btree (aid) "pgbench_accounts_abalance_idx" btree (abalance) Options: fillfactor=100 New opened session successfully uses this index. Tested with PostgreSQL 9.5.1. --- Dmitry Vasilyev Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] The plan for FDW-based sharding
We do not have formal prove that proposed XTM is "general enough" to handle all possible transaction manager implementations. But there are two general ways of dealing with isolation: snapshot based and CSN based. pg_dtm and pg_tsdtm prove that both of them can be implemented using XTM. If you know some approach to distributed transaction manager implementation, please let us know. Otherwise your statement "is not general enough" is not concrete enough. Postgres-XL GTM can be in principle implemented as extension based on XTM. This API is based on existed PostgreSQL TM functions: we do not introduce some new abstractions. Is it possible that some other TM function has to be encapsulated? Yes, it is. But I do not see much problems with adding this function to XTM in future if it is actually needed. It happens with most APIs. It is awful when API functions are changed, breaking application based on this API. But as far as functions encapsulated in XTM are in any case present in PostgreSQL core, I do not think that them will be changed in future unless there are some plans to completely rewrite Postgres transaction manager... Yes, it is certainly possible to develop cluster by cloning PostgreSQL. But it cause big problems both for developers, which have to permanently synchronize their branch with master, and, what is more important, for customers, which can not use standard version of PostgreSQL. It may cause problems with system certification, with running Postgres in cloud,... Actually the history of Postgres-XL/XC and Greenplum IMHO shows that it is wrong direction. On 26.02.2016 19:06, Robert Haas wrote: On Fri, Feb 26, 2016 at 7:21 PM, Oleg Bartunov wrote: Right now tm is hardcoded and it's doesn't matter "if other people might need" at all. We at least provide developers ("other people") ability to work on their implementations and the patch is safe and doesn't sacrifices anything in core. I don't believe that. When we install APIs into core, we're committing to keep those APIs around. And I think that we're far too early in the development of transaction managers for PostgreSQL to think that we know what APIs we want to commit to over the long term. And what makes us think we really need multiple transaction managers, anyway? If you brave enough to say that one tm-fits-all and you are able to teach existed tm to play well in various clustering environment during development period, which is short, than probably we don't need multiple tms. But It's too perfect to believe and practical solution is to let multiple groups to work on their solutions. Nobody's preventing multiple groups for working on their solutions. That's not the question. The question is why we should install hooks in core at this early stage without waiting to see which implementations prove to be best and whether those hooks are actually general enough to cater to everything people want to do. There is talk of integrating XC/XL work into PostgreSQL; it has a GTM. Postgres Pro has several GTMs. Maybe there will be others. Frankly, I'd like to see a GTM in core at some point because I'd like everybody who uses PostgreSQL to have access to a GTM. What I don't want is for every PostgreSQL company to develop its own GTM and distribute it separately from everybody else's. IIUC, MySQL kinda did that with storage engines and it resulted in the fragmentation of the community. We've had the same thing happen with replication tools - every PostgreSQL company develops their own set. It would have been better to have ONE set that was distributed by the core project so that we didn't all do the same work over again. I don't understand the argument that without these hooks in core, people can't continue to work on this. It isn't hard to work on GTM without any core changes at all. You just patch your copy of PostgreSQL. We do this all the time, for every patch. We don't add hooks for every patch. dtms. It's time to start working on dtm, I believe. The fact you don't think about distributed transactions support doesn't mean there no "other people", who has different ideas on postgres future. That's why we propose this patch, let's play the game ! I don't like to play games with the architecture of PostgreSQL. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] The plan for FDW-based sharding
On 02/26/2016 09:30 PM, Alvaro Herrera wrote: Konstantin Knizhnik wrote: Yes, it is certainly possible to develop cluster by cloning PostgreSQL. But it cause big problems both for developers, which have to permanently synchronize their branch with master, and, what is more important, for customers, which can not use standard version of PostgreSQL. It may cause problems with system certification, with running Postgres in cloud,... Actually the history of Postgres-XL/XC and Greenplum IMHO shows that it is wrong direction. That's not the point, though. I don't think a Postgres clone with a GTM solves any particular problem that's not already solved by the existing forks. However, if you have a clone at home and you make a GTM work on it, then you take the GTM as a patch and post it for discussion. There's no need for hooks for that. Just make sure your GTM solves the problem that it is supposed to solve. Excuse me if I've missed the discussion elsewhere -- why does PostgresPro have *two* GTMs instead of a single one? There are many different clusters which require different approaches for managing distributed transactions. Some clusters do no need distributed transactions at all: if you are executing OLAP queries on read-only database GTM will just add extra overhead. pg_dtm uses centralized arbiter. It is similar with Postgres-XL DTM. Presence of single arbiter signficantly simplify all distributed algorithms: failure detection, global deadlock elimination, ... But at the same time arbiter is SPOF and main factor limiting cluster scalability. pg_tsdtm is based on another approach: it is using system time as CSN and doesn't require arbiter. In theory there is no limit for scalability. But differences in system time and necessity to use more rounds of communication have negative impact on performance. So there is no ideal solution which can work well for all cluster. This is why it is not possible to develop just one GTM, propose it as a patch for review and then (hopefully) commit it in Postgres core. IMHO it will never happen. And I do not think that it is actually needed. What we need is a way to be able to create own transaction managers as Postgres extension not affecting its core. All arguments against XTM can be applied to any other extension API in Postgres, for example FDW. Is it general enough? There are many useful operations which currently are not handled by this API. For example performing aggregation and grouping at foreign server side. But still it is very useful and flexible mechanism, allowing to implement many wonderful things. From my point of view good system should be as open and customizable as possible, if it doesn't affect performance. Replacing direct function calls with indirect function calls in almost all cases can not suffer performance as well as adding hooks. So without any extra price we get better flexibility. What's wrong with it? -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Logical decoding restart problems
Hi, We are using logical decoding in multimaster and we are faced with the problem that inconsistent transactions are sent to replica. Briefly, multimaster is using logical decoding in this way: 1. Each multimaster node is connected with each other using logical decoding channel and so each pair of nodes has its own replication slot. 2. In normal scenario each replication channel is used to replicate only those transactions which were originated at the source node. We are using origin mechanism to skip "foreign" transactions. 2. When offline cluster node is returned back to the multimaster we need to recover this node to the current cluster state. Recovery is performed from one of the cluster's node. So we are using only one replication channel to receive all (self and foreign) transactions. Only in this case we can guarantee consistent order of applying transactions at recovered node. After the end of recovery we need to recreate replication slots with all other cluster nodes (because we have already replied transactions from this nodes). To restart logical decoding we first drop existed slot, then create new one and then start logical replication from the WAL position 0/0 (invalid LSN). In this case recovery should be started from the last consistent point. The problem is that for some reasons consistent point is not so consistent and we get partly decoded transactions. I.e. transaction body consists of two UPDATE but reorder_buffer extracts only the one (last) update and sent this truncated transaction to destination causing consistency violation at replica. I started investigation of logical decoding code and found several things which I do not understand. Assume that we have transactions T1={start_lsn=100, end_lsn=400} and T2={start_lsn=200, end_lsn=300}. Transaction T2 is sent to the replica and replica confirms that flush_lsn=300. If now we want to restart logical decoding, we can not start with position less than 300, because CreateDecodingContext doesn't allow it: * start_lsn * The LSN at which to start decoding. If InvalidXLogRecPtr, restart * from the slot's confirmed_flush; otherwise, start from the specified * location (but move it forwards to confirmed_flush if it's older than * that, see below). * else if (start_lsn < slot->data.confirmed_flush) { /* * It might seem like we should error out in this case, but it's * pretty common for a client to acknowledge a LSN it doesn't have to * do anything for, and thus didn't store persistently, because the * xlog records didn't result in anything relevant for logical * decoding. Clients have to be able to do that to support synchronous * replication. */ So it means that we have no chance to restore T1? What is worse, if there are valid T2 transaction records with lsn >= 300, then we can partly decode T1 and send this T1' to the replica. I missed something here? Are there any alternative way to "seek" slot to the proper position without actual fetching data from it or recreation of the slot? Is there any mechanism in xlog which can enforce consistent decoding of transaction (so that no transaction records are missed)? May be I missed something but I didn't find any "record_number" or something else which can identify first record of transaction. Thanks in advance, Konstantin Knizhnik, Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Re: [HACKERS] Logical decoding restart problems
EATE_REPLICATION_SLOT "mtm_slot_1" LOGICAL "multimaster"; START_REPLICATION SLOT "mtm_slot_1" LOGICAL 0/0 ("startup_params_format" '1', "max_proto_version" '1', "min_proto_version" '1', "forward_changesets" '1', "mtm_replication_mode" 'recovered'); I have also tried to calculate last origin LSN for this node and explicitly specify it in START_REPLICATION. But it doesn't help: the same problem persists. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical decoding restart problems
Thank you for answers. > No, you don't need to recreate them. Just advance your replication identifier > downstream and request a replay position in the future. Let the existing slot > skip over unwanted data and resume where you want to start replay. > > You can advance the replication origins on the peers as you replay forwarded > xacts from your master. > > Have a look at how the BDR code does this during "catchup mode" replay. > > So while your problem discussed below seems concerning, you don't have to > drop and recreate slots like are currently doing. The only reason for recreation of slot is that I want to move it to the current "horizont" and skip all pending transaction without explicit specification of the restart position. If I do not drop the slot and just restart replication specifying position 0/0 (invalid LSN), then replication will be continued from the current slot position in WAL, will not it? So there is no way to specify something "start replication from the end of WAL", like lseek(0, SEEK_END). Right now I trying to overcome this limitation by explicit calculation of the position from which we should continue replication. But unfortunately the problem with partly decoded transactions persist. I will try at next week to create example reproducing the problem without any multimaster stuff, just using standard logical decoding plugin. > > To restart logical decoding we first drop existed slot, then create new one > and then start logical replication from the WAL position 0/0 (invalid LSN). > In this case recovery should be started from the last consistent point. > > How do you create the new slot? SQL interface? walsender interface? Direct C > calls? Slot is created by peer node using standard libpq connection with database=replication connection string. > > The problem is that for some reasons consistent point is not so consistent > and we get partly decoded transactions. > I.e. transaction body consists of two UPDATE but reorder_buffer extracts only > the one (last) update and sent this truncated transaction to destination > causing consistency violation at replica. I started investigation of logical > decoding code and found several things which I do not understand. > > Yeah, that sounds concerning and shouldn't happen. I looked at replication code more precisely and understand that my first concerns were wrong. Confirming flush position should not prevent replaying transactions with smaller LSNs. But unfortunately the problem is really present. May be it is caused by race conditions (although most logical decoder data is local to backend). This is why I will try to create reproducing scenario without multimaster. > > Assume that we have transactions T1={start_lsn=100, end_lsn=400} and > T2={start_lsn=200, end_lsn=300}. > Transaction T2 is sent to the replica and replica confirms that flush_lsn=300. > If now we want to restart logical decoding, we can not start with position > less than 300, because CreateDecodingContext doesn't allow it: > > > Right. You've already confirmed receipt of T2, so you can't receive it again. > > So it means that we have no chance to restore T1? > > Wrong. You can, because the slot's restart_lsn still be will be some LSN <= > 100. The slot keeps track of inprogress transactions (using xl_running_xacts > records) and knows it can't discard WAL past lsn 100 because xact T1 is still > in-progress, so it must be able to decode from the start of it. > > When you create a decoding context decoding starts at restart_lsn not at > confirmed_flush_lsn. confirmed_flush_lsn is the limit at which commits start > resulting in decoded data being sent to you. So in your case, T1 commits at > lsn=400, which is >300, so you'll receive the whole xact for T1. Yeh, but unfortunately it happens. Need to understand why... > > It's all already there. See logical decoding's use of xl_running_xacts. But how this information is persisted? What will happen if wal_sender is restarted? > > -- > Craig Ringer http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] UPSERT strange behavior
On 08/25/2016 10:08 PM, Peter Geoghegan wrote: On Thu, Aug 25, 2016 at 11:49 AM, Tom Lane wrote: I think the point is that given the way he's set up the test case, there should be no duplicate violation in the plain unique index unless there is one in the arbiter index. So assuming that INSERT tests the arbiter indexes first, there shouldn't be an error. Maybe it doesn't do that, but it seems like it would be a good idea if it did. Oh, yeah. This is arguably an example of inference failing to infer multiple unique indexes as arbiters. Inference could, in principle, recognize that the second unique index is equivalent to the first, but doesn't. (I don't think that it matters which order anything is tested in, though, because not finding a dup value in the arbiter index does not guarantee that there won't be one in the other index. There is no locking when no conflict is initially found, and so no guarantees here.) Anyway, I don't have a lot of sympathy for this point of view, because the scenario is completely contrived. You have to draw the line somewhere. I do not think that this scenario is completely contrived: the cases when a table has more than one primary key are quite common. For example, "user" may have unique e-mail address, phone number and login. Also, as far as I know, this is not an artificial example, but real case taken from customers application... I am not sure weather it's really bug or feature, but the user's intention was obvious: locate record by one of the unique keys and if such record already exists, then increment counter (do update instead of insert). But there are also good arguments why upsert may report conflict in this case... If such UPSERT behavior is assumed to be correct, what is the best workaround for the problem if we really need to have to separate indexes and want to enforce unique constraint for both keys? -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Handling dropped attributes in pglogical_proto
Hi, pglogical_read_tuple from pglogical_proto.c contains the following code: natts = pq_getmsgint(in, 2); if (rel->natts != natts) elog(ERROR, "tuple natts mismatch, %u vs %u", rel->natts, natts); But if table was just altered and some attribute was removed from the table, then rel->natts can be greater than natts. The problem can be fixed by the following patch: --- a/pglogical_proto.c +++ b/pglogical_proto.c @@ -263,10 +263,15 @@ pglogical_read_tuple(StringInfo in, PGLogicalRelation *rel, { int attid = rel->attmap[i]; Form_pg_attribute att = desc->attrs[attid]; - charkind = pq_getmsgbyte(in); + charkind; const char *data; int len; + if (att->atttypid == InvalidOid) { + continue; + } + kind = pq_getmsgbyte(in); + switch (kind) { case 'n': /* null */ -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Issue with logical replication: MyPgXact->xmin already is valid
When creating logical replication slots we quite often get the following error: ERROR: cannot build an initial slot snapshot when MyPgXact->xmin already is valid which cause restart of WAL sender. The comment to this line doesn't clarify much: /* so we don't overwrite the existing value */ if (TransactionIdIsValid(MyPgXact->xmin)) elog(ERROR, "cannot build an initial slot snapshot when MyPgXact->xmin already is valid"); I have checked that MyPgXact->xmin is set by GetSnapshotData: #3 0x0086025e in GetSnapshotData (snapshot=0xf28040 ) at procarray.c:1714 #4 0x00a48523 in GetNonHistoricCatalogSnapshot (relid=2615) at snapmgr.c:479 #5 0x00a484d3 in GetCatalogSnapshot (relid=2615) at snapmgr.c:452 #6 0x004f15bf in systable_beginscan (heapRelation=0x256bdb0, indexId=2684, indexOK=1 '\001', snapshot=0x0, nkeys=1, key=0x7ffdf633c770) at genam.c:353 #7 0x009d676e in SearchCatCache (cache=0x25230a8, v1=39586984, v2=0, v3=0, v4=0) at catcache.c:1169 #8 0x009ec13d in SearchSysCache (cacheId=35, key1=39586984, key2=0, key3=0, key4=0) at syscache.c:1109 #9 0x009ec259 in GetSysCacheOid (cacheId=35, key1=39586984, key2=0, key3=0, key4=0) at syscache.c:1187 #10 0x00574b85 in get_namespace_oid (nspname=0x25c0ca8 "pg_catalog", missing_ok=1 '\001') at namespace.c:3009 #11 0x00574886 in LookupExplicitNamespace (nspname=0x25c0ca8 "pg_catalog", missing_ok=1 '\001') at namespace.c:2871 #12 0x0057437d in get_ts_config_oid (names=0x25c2438, missing_ok=1 '\001') at namespace.c:2653 #13 0x009f56ca in check_TSCurrentConfig (newval=0x7ffdf633cb90, extra=0x7ffdf633cba0, source=PGC_S_FILE) at ts_cache.c:600 #14 0x00a1bbdc in call_string_check_hook (conf=0xf26870 , newval=0x7ffdf633cb90, extra=0x7ffdf633cba0, source=PGC_S_FILE, elevel=12) at guc.c:9912 #15 0x00a14420 in parse_and_validate_value (record=0xf26870 , name=0x25c0620 "default_text_search_config", value=0x25c0658 "pg_catalog.english", source=PGC_S_FILE, elevel=12, newval=0x7ffdf633cb90, newextra=0x7ffdf633cba0) at guc.c:5840 #16 0x00a15543 in set_config_option (name=0x25c0620 "default_text_search_config", value=0x25c0658 "pg_catalog.english", context=PGC_SIGHUP, source=PGC_S_FILE, action=GUC_ACTION_SET, changeVal=1 '\001', elevel=12, is_reload=0 '\000') at guc.c:6442 #17 0x00a1eb5c in ProcessConfigFileInternal (context=PGC_SIGHUP, applySettings=1 '\001', elevel=13) at guc-file.l:439 #18 0x00a1e4ee in ProcessConfigFile (context=PGC_SIGHUP) at guc-file.l:155 #19 0x0082433c in WalSndWaitForWal (loc=25991904) at walsender.c:1309 #20 0x00823423 in logical_read_xlog_page (state=0x25a4f10, targetPagePtr=25985024, reqLen=6880, targetRecPtr=25991880, cur_page=0x25a6c60 "\227\320\005", pageTLI=0x25a57bc) at walsender.c:761 #21 0x00558c3d in ReadPageInternal (state=0x25a4f10, pageptr=25985024, reqLen=6880) at xlogreader.c:556 #22 0x00558405 in XLogReadRecord (state=0x25a4f10, RecPtr=25991880, errormsg=0x7ffdf633cea8) at xlogreader.c:255 #23 0x0080dda6 in DecodingContextFindStartpoint (ctx=0x25a4c50) at logical.c:450 #24 0x00823a0c in CreateReplicationSlot (cmd=0x24dc398) at walsender.c:934 #25 0x008247e4 in exec_replication_command (cmd_string=0x254e8f0 "CREATE_REPLICATION_SLOT \"shardman_copy_t_10_3_4_17307_sync_17302\" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT") at walsender.c:1515 #26 0x0088eccc in PostgresMain (argc=1, argv=0x24f0b28, dbname=0x24f0948 "postgres", username=0x24bf7f0 "knizhnik") at postgres.c:4086 #27 0x007ef9e2 in BackendRun (port=0x24dee00) at postmaster.c:4357 #28 0x007ef10c in BackendStartup (port=0x24dee00) at postmaster.c:4029 #29 0x007eb6cc in ServerLoop () at postmaster.c:1753 #30 0x007eacb8 in PostmasterMain (argc=3, argv=0x24bd660) at postmaster.c:1361 #31 0x00728593 in main (argc=3, argv=0x24bd660) at main.c:228 I wonder if it is normal situation or something goes wrong? -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issue with logical replication: MyPgXact->xmin already is valid
On 06.10.2017 15:29, Petr Jelinek wrote: On 06/10/17 12:16, Konstantin Knizhnik wrote: When creating logical replication slots we quite often get the following error: ERROR: cannot build an initial slot snapshot when MyPgXact->xmin already is valid which cause restart of WAL sender. The comment to this line doesn't clarify much: /* so we don't overwrite the existing value */ if (TransactionIdIsValid(MyPgXact->xmin)) elog(ERROR, "cannot build an initial slot snapshot when MyPgXact->xmin already is valid"); I wonder if it is normal situation or something goes wrong? Hi, no it's not normal situation, it seems you are doing something that assigns xid before you run the CREATE_REPLICATION_SLOT command on that connection. I have not doing something in this connection: it is wal sender executing CREATE_REPLICATION_SLOT replication command. Please look at the stack in my original e-mail. It shows who and when is setting MyPgXact->xmin. It is GetSnapshotData called because of reloading configuration settings. And configuration setting are reloaded because our application is updating "synchronous_standby_names". -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issue with logical replication: MyPgXact->xmin already is valid
On 10/07/2017 04:26 PM, Petr Jelinek wrote: Hmm so you start transaction (you have to when running CREATE_REPLICATION_SLOT with USE_SNAPSHOT parameter). And while the slot is being created the config is reloaded. And since now you are in transaction the tsearch hook for GUC processing tries to access catalogs which sets the xmin for the transaction. Actually this slot is implicitly created by LogicalRepSyncTableStart to perform initial data sync. That's not good, but I can't really say I have idea about what to do with it other than to set some kind of flag saying that logical decoding snapshot is being built and using that to skip catalog access which does not seem very pretty. It is not quite clear from the comment why it is not possible to overwrite MyPgXact->xmin or just use existed value. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Slow synchronous logical replication
In our sharded cluster project we are trying to use logical relication for providing HA (maintaining redundant shard copies). Using asynchronous logical replication has not so much sense in context of HA. This is why we try to use synchronous logical replication. Unfortunately it shows very bad performance. With 50 shards and level of redundancy=1 (just one copy) cluster is 20 times slower then without logical replication. With asynchronous replication it is "only" two times slower. As far as I understand, the reason of such bad performance is that synchronous replication mechanism was originally developed for streaming replication, when all replicas have the same content and LSNs. When it is used for logical replication, it behaves very inefficiently. Commit has to wait confirmations from all receivers mentioned in "synchronous_standby_names" list. So we are waiting not only for our own single logical replication standby, but all other standbys as well. Number of synchronous standbyes is equal to number of shards divided by number of nodes. To provide uniform distribution number of shards should >> than number of nodes, for example for 10 nodes we usually create 100 shards. As a result we get awful performance and blocking of any replication channel blocks all backends. So my question is whether my understanding is correct and synchronous logical replication can not be efficiently used in such manner. If so, the next question is how difficult it will be to make synchronous replication mechanism for logical replication more efficient and are there some plans to work in this direction? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Slow synchronous logical replication
On 10/07/2017 10:42 PM, Andres Freund wrote: Hi, On 2017-10-07 22:39:09 +0300, konstantin knizhnik wrote: In our sharded cluster project we are trying to use logical relication for providing HA (maintaining redundant shard copies). Using asynchronous logical replication has not so much sense in context of HA. This is why we try to use synchronous logical replication. Unfortunately it shows very bad performance. With 50 shards and level of redundancy=1 (just one copy) cluster is 20 times slower then without logical replication. With asynchronous replication it is "only" two times slower. As far as I understand, the reason of such bad performance is that synchronous replication mechanism was originally developed for streaming replication, when all replicas have the same content and LSNs. When it is used for logical replication, it behaves very inefficiently. Commit has to wait confirmations from all receivers mentioned in "synchronous_standby_names" list. So we are waiting not only for our own single logical replication standby, but all other standbys as well. Number of synchronous standbyes is equal to number of shards divided by number of nodes. To provide uniform distribution number of shards should >> than number of nodes, for example for 10 nodes we usually create 100 shards. As a result we get awful performance and blocking of any replication channel blocks all backends. So my question is whether my understanding is correct and synchronous logical replication can not be efficiently used in such manner. If so, the next question is how difficult it will be to make synchronous replication mechanism for logical replication more efficient and are there some plans to work in this direction? This seems to be a question that is a) about a commercial project we don't know much about b) hasn't received a lot of investigation. Sorry, If I was not clear. The question was about logical replication mechanism in mainstream version of Postgres. I think that most of people are using asynchronous logical replication and synchronous LR is something exotic and not well tested and investigated. It will be great if I am wrong:) Concerning our sharded cluster (pg_shardman) - it is not a commercial product yet, it is in development phase. We are going to open its sources when it will be more or less stable. But unlike multimaster, this sharded cluster is mostly built from existed components: pg_pathman + postgres_fdw + logical replication. So we are just trying to combine them all into some integrated system. But currently the most obscure point is logical replication. And the main goal of my e-mail was to know the opinion of authors and users of LR whether it is good idea to use LR to provide fault tolerance in sharded cluster. Or some other approaches, for example sharding with redundancy or using streaming replication are preferable? -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Slow synchronous logical replication
Thank you for explanations. On 08.10.2017 16:00, Craig Ringer wrote: I think it'd be helpful if you provided reproduction instructions, test programs, etc, making it very clear when things are / aren't related to your changes. It will be not so easy to provide some reproducing scenario, because actually it involves many components (postgres_fdw, pg_pasthman, pg_shardman, LR,...) and requires multinode installation. But let me try to explain what going on: So we have implement sharding - splitting data between several remote tables using pg_pathman and postgres_fdw. It means that insert or update of parent table cause insert or update of some derived partitions which is forwarded by postgres_fdw to the correspondent node. Number of shards is significantly larger than number of nodes, i.e. for 5 nodes we have 50 shards. Which means that at each onde we have 10 shards. To provide fault tolerance each shard is replicated using logical replication to one or more nodes. Right now we considered only redundancy level 1 - each shard has only one replica. So from each node we establish 10 logical replication channels. We want commit to wait until data is actually stored at all replicas, so we are using synchronous replication: So we set synchronous_commit option to "on" and include all ten 10 subscriptions in synchronous_standby_names list. In this setup commit latency is very large (about 100msec and most of the time is actually spent in commit) and performance is very bad - pgbench shows about 300 TPS for optimal number of clients (about 10, for larger number performance is almost the same). Without logical replication at the same setup we get about 6000 TPS. I have checked syncrepl.c file, particularly SyncRepGetSyncRecPtr function. Each wal sender independently calculates minimal LSN among all synchronous replicas and wakeup backends waiting for this LSN. It means that transaction performing update of data in one shard will actually wait confirmation from replication channels for all shards. If some shard is updated rarely than other or is not updated at all (for example because communication channels between this node is broken), then all backens will stuck. Also all backends are competing for the single SyncRepLock, which also can be a contention point. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Columnar storage support
Unfortunately C-Store doesn't allow to take all advantages of columnar store: you still not be able to perform vector operation.s C-Store allows to reduce size of data read from the disk because of 1. fetching only columns which are used in the query, 2. data compression. It will lead to some benefits in query execution speed for cold data (when it is not loaded in cache). For warm data there is almost no difference (except very huge tables which can not fit in memory). But the main advantage of vertical data format - vector data processing - is possible only with specialized executor. There is prototype of vector executor for C-Store: https://github.com/citusdata/postgres_vectorization_test It provides 3-4x speedup of some queries, but it is really prototype and research project, for from practical usage. I have also developed two columnar store extensions for Postgres: IMCS (In-Memory-Columnar-Store): https://github.com/knizhnik/imcs.git VOPS (Vectorized Operations): https://github.com/postgrespro/vops.git First one is more oriented on in-memory databases (although support spilling data to the disk) and requires to use special functions to manipulate with columnar data. In this case columnar store is copy of main (horizontal) store (normal Postgres tables). VOPS is more recent work, allowing to use more or less normal SQL (using foreign data wrapper and user defined types/operators). In VOPS data is stored inside normal Postgres tables, but using vectors (tiles) instead of scalars. Both IMCS and VOPS provides 10-100 times speed improvement on queries like Q1 in TPC-H (sequential scan with filtering and aggregation). In queries involving joins there is almost no benefit comparing with normal Postgres. There is also columnar storage extension developed by Fujitsu: https://www.postgresql.org/message-id/cajrrpgfac7wc9nk6ptty6yn-nn+hcy8xolah2doyhvg5d6h...@mail.gmail.com But the published patch is only first step in this direction and it is not possible neither to use it in practice, neither perform some experiments measuring possible improvement of performance. On 09.10.2017 23:06, Joshua D. Drake wrote: On 10/09/2017 01:03 PM, legrand legrand wrote: Is there a chance that pluggable storage permits creation of a columnar rdbms as monetDB in PostgreSQL ? Thanks un advance for thé answer The extension C-Store from Citus is probably what you are looking for. jD -- Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Slow synchronous logical replication
On 11.10.2017 10:07, Craig Ringer wrote: On 9 October 2017 at 15:37, Konstantin Knizhnik wrote: Thank you for explanations. On 08.10.2017 16:00, Craig Ringer wrote: I think it'd be helpful if you provided reproduction instructions, test programs, etc, making it very clear when things are / aren't related to your changes. It will be not so easy to provide some reproducing scenario, because actually it involves many components (postgres_fdw, pg_pasthman, pg_shardman, LR,...) So simplify it to a test case that doesn't. The simplest reproducing scenario is the following: 1. Start two Posgtgres instances: synchronous_commit=on, fsync=off 2. Initialize pgbench database at both instances: pgbench -i 3. Create publication for pgbench_accounts table at one node 4. Create correspondent subscription at another node with copy_data=false parameter 5. Add subscription to synchronous_standby_names at first node. 6. Start pgbench -c 8 -N -T 100 -P 1 at first node. At my systems results are the following: standalone postgres: 8600 TPS asynchronous replication: 6600 TPS synchronous replication: 5600 TPS Quite good results. 7. Create some dummy table and perform bulk insert in it: create table dummy(x integer primary key); insert into dummy values (generate_series(1,1000)); pgbench almost stuck: until end of insert performance drops almost to zero. The reason of such behavior is obvious: wal sender has to decode huge transaction generate by insert although it has no relation to this publication. Filtering of insert records of this transaction is done only inside output plug-in. Unfortunately it is not quite clear how to make wal-sender smarter and let him skip transaction not affecting its publication. Once of the possible solutions is to let backend inform wal-sender about smallest LSN it should wait for (backend knows which table is affected by current operation, so which publications are interested in this operation and so can point wal -sender to the proper LSN without decoding huge part of WAL. But it seems to be not so easy to implement. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Slow synchronous logical replication
On 12.10.2017 04:23, Craig Ringer wrote: On 12 October 2017 at 00:57, Konstantin Knizhnik wrote: The reason of such behavior is obvious: wal sender has to decode huge transaction generate by insert although it has no relation to this publication. It does. Though I wouldn't expect anywhere near the kind of drop you report, and haven't observed it here. Is the CREATE TABLE and INSERT done in the same transaction? No. Table was create in separate transaction. Moreover the same effect will take place if table is create before start of replication. The problem in this case seems to be caused by spilling decoded transaction to the file by ReorderBufferSerializeTXN. Please look at two profiles: http://garret.ru/lr1.svg corresponds to normal work if pgbench with synchronous replication to one replica, http://garret.ru/lr2.svg - the with concurrent execution of huge insert statement. And here is output of pgbench (at fifth second insert is started): progress: 1.0 s, 10020.9 tps, lat 0.791 ms stddev 0.232 progress: 2.0 s, 10184.1 tps, lat 0.786 ms stddev 0.192 progress: 3.0 s, 10058.8 tps, lat 0.795 ms stddev 0.301 progress: 4.0 s, 10230.3 tps, lat 0.782 ms stddev 0.194 progress: 5.0 s, 10335.0 tps, lat 0.774 ms stddev 0.192 progress: 6.0 s, 4535.7 tps, lat 1.591 ms stddev 9.370 progress: 7.0 s, 419.6 tps, lat 20.897 ms stddev 55.338 progress: 8.0 s, 105.1 tps, lat 56.140 ms stddev 76.309 progress: 9.0 s, 9.0 tps, lat 504.104 ms stddev 52.964 progress: 10.0 s, 14.0 tps, lat 797.535 ms stddev 156.082 progress: 11.0 s, 14.0 tps, lat 601.865 ms stddev 93.598 progress: 12.0 s, 11.0 tps, lat 658.276 ms stddev 138.503 progress: 13.0 s, 9.0 tps, lat 784.120 ms stddev 127.206 progress: 14.0 s, 7.0 tps, lat 870.944 ms stddev 156.377 progress: 15.0 s, 8.0 tps, lat .578 ms stddev 140.987 progress: 16.0 s, 7.0 tps, lat 1258.750 ms stddev 75.677 progress: 17.0 s, 6.0 tps, lat 991.023 ms stddev 229.058 progress: 18.0 s, 5.0 tps, lat 1063.986 ms stddev 269.361 It seems to be effect of large transactions. Presence of several channels of synchronous logical replication reduce performance, but not so much. Below are results at another machine and pgbench with scale 10. Configuraion standalone 1 async logical replica 1 sync logical replca 3 async logical replicas 3 syn logical replicas TPS 15k 13k 10k 13k 8k Only partly true. The output plugin can register a transaction origin filter and use that to say it's entirely uninterested in a transaction. But this only works based on filtering by origins. Not tables. Yes I know about origin filtering mechanism (and we are using it in multimaster). But I am speaking about standard pgoutput.c output plugin. it's pgoutput_origin_filter always returns false. I imagine we could call another hook in output plugins, "do you care about this table", and use it to skip some more work for tuples that particular decoding session isn't interested in. Skip adding them to the reorder buffer, etc. No such hook currently exists, but it'd be an interesting patch for Pg11 if you feel like working on it. Unfortunately it is not quite clear how to make wal-sender smarter and let him skip transaction not affecting its publication. As noted, it already can do so by origin. Mostly. We cannot totally skip over WAL, since we need to process various invalidations etc. See ReorderBufferSkip. The problem is that before end of transaction we do not know whether it touch this publication or not. So filtering by origin will not work in this case. I really not sure that it is possible to skip over WAL. But the particular problem with invalidation records etc can be solved by always processing this records by WAl sender. I.e. if backend is inserting invalidation record or some other record which always should be processed by WAL sender, it can always promote LSN of this record to WAL sender. So WAl sender will skip only those WAl records which is safe to skip (insert/update/delete records not affecting this publication). I wonder if there can be some other problems with skipping part of transaction by WAL sender. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
[HACKERS] Deadlock in ALTER SUBSCRIPTION REFRESH PUBLICATION
ty_hook (first_arg=0x24dd998, queryString=0x24dc920 "SET SESSION synchronous_commit TO local; ALTER SUBSCRIPTION sub_3_1 REFRESH PUBLICATION", context=PROCESS_UTILITY_QUERY, params=0x0, queryEnv=0x0, dest=0x24ddcb8, completionTag=0x73a43c90 "") at src/hooks.c:913 #14 0x008a80a2 in ProcessUtility (pstmt=0x24dd998, queryString=0x24dc920 "SET SESSION synchronous_commit TO local; ALTER SUBSCRIPTION sub_3_1 REFRESH PUBLICATION", context=PROCESS_UTILITY_QUERY, params=0x0, queryEnv=0x0, dest=0x24ddcb8, completionTag=0x73a43c90 "") at utility.c:353 #15 0x008a7075 in PortalRunUtility (portal=0x246be60, pstmt=0x24dd998, isTopLevel=0 '\000', setHoldSnapshot=0 '\000', dest=0x24ddcb8, completionTag=0x73a43c90 "") at pquery.c:1178 #16 0x008a728d in PortalRunMulti (portal=0x246be60, isTopLevel=0 '\000', setHoldSnapshot=0 '\000', dest=0x24ddcb8, altdest=0x24ddcb8, completionTag=0x73a43c90 "") at pquery.c:1324 #17 0x008a6757 in PortalRun (portal=0x246be60, count=9223372036854775807, isTopLevel=0 '\000', run_once=1 '\001', dest=0x24ddcb8, altdest=0x24ddcb8, completionTag=0x73a43c90 "") at pquery.c:799 #18 0x008a0288 in exec_simple_query (query_string=0x24dc920 "SET SESSION synchronous_commit TO local; ALTER SUBSCRIPTION sub_3_1 REFRESH PUBLICATION") at postgres.c:1099 #19 0x008a4823 in PostgresMain (argc=1, argv=0x247c2d0, dbname=0x247c2a8 "postgres", username=0x244f870 "knizhnik") at postgres.c:4090 #20 0x00801753 in BackendRun (port=0x240) at postmaster.c:4357 #21 0x00800e5f in BackendStartup (port=0x240) at postmaster.c:4029 #22 0x007fd398 in ServerLoop () at postmaster.c:1753 #23 0x007fc92f in PostmasterMain (argc=3, argv=0x244d6e0) at postmaster.c:1361 #24 0x00734f08 in main (argc=3, argv=0x244d6e0) at main.c:228 The reason of this deadlock seems to be clear: ALTER SUBSCRIPTION starts transaction at one node and tries to create slot at other node, which waiting for completion of all active transaction while building scnapshpot. Is there any way to avoid this deadlock? The same deadlock can happen in concurrent CREATE SUBSCRIPTION with implicit slot creation. But here it can be sovled by explicit slot creation prior to CREATE SUBSCRIPTION. But what can I do if I want to update publications and refresh subscriptions? Thanks in advance, -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Secondary index access optimizations
On 11/06/2017 04:27 AM, Thomas Munro wrote: On Fri, Sep 8, 2017 at 3:58 AM, Konstantin Knizhnik wrote: Updated version of the patch is attached to this mail. Also I added support of date type to operator_predicate_proof to be able to imply (logdate <= '2017-03-31') from (logdate < '2017-04-01') . Hi Konstantin, Is there any reason why you don't want to split this into two separate proposals? One for remove_restrictions_implied_by_constraints() and one for the operator_predicate_proof() changes. Your v3 patch breaks the new partition_join test (the recently committed partition-wise join stuff), as far as I can tell in a good way. Can you please double check those changes and post an updated patch? Hi Thomas. The primary idea of this patch was to provide more efficient plans for queries on partitioned tables. So remove_restrictions_implied_by_constraints() removes redundant predicate checks. But it doesn't work for standard Postgres 10 partitioning, because here constraints are set using intervals with open high boundary and original version of operator_predicate_proof() is not able to handle this case. I have explained this problem in my previous e-mails in this thread. This is why I have changed operator_predicate_proof() to correctly handle this case. If you think this patch should be splitted into two, ok: I can do it. I just want to notice that without patching operator_predicate_proof() it may give not positive effect for standard partitioning, which I expect to be the most popular use case where this optimization may have an effect. Concerning broken partition_join test: it is "expected" failure: my patch removes from the plans redundant checks. So the only required action is to update expected file with results. Attached please find updated patch. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out index 4339bbf..0931af1 100644 --- a/contrib/postgres_fdw/expected/postgres_fdw.out +++ b/contrib/postgres_fdw/expected/postgres_fdw.out @@ -626,12 +626,12 @@ EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE c1 IS NULL;-- Nu Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8 FROM "S 1"."T 1" WHERE (("C 1" IS NULL)) (3 rows) -EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE c1 IS NOT NULL;-- NullTest - QUERY PLAN -- +EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE c1 IS NOT NULL and c3 is not null;-- NullTest +QUERY PLAN +-- Foreign Scan on public.ft1 t1 Output: c1, c2, c3, c4, c5, c6, c7, c8 - Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8 FROM "S 1"."T 1" WHERE (("C 1" IS NOT NULL)) + Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8 FROM "S 1"."T 1" WHERE ((c3 IS NOT NULL)) (3 rows) EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE round(abs(c1), 0) = 1; -- FuncExpr diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql index ddfec79..878bfc7 100644 --- a/contrib/postgres_fdw/sql/postgres_fdw.sql +++ b/contrib/postgres_fdw/sql/postgres_fdw.sql @@ -292,7 +292,7 @@ RESET enable_nestloop; EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE t1.c1 = 1; -- Var, OpExpr(b), Const EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE t1.c1 = 100 AND t1.c2 = 0; -- BoolExpr EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE c1 IS NULL;-- NullTest -EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE c1 IS NOT NULL;-- NullTest +EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE c1 IS NOT NULL and c3 is not null;-- NullTest EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE round(abs(c1), 0) = 1; -- FuncExpr EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE c1 = -c1; -- OpExpr(l) EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE 1 = c1!; -- OpExpr(r) diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c index a5c1b68..082d1cc 100644 --- a/src/backend/optimizer/path/allpaths.c +++ b/src/backend/optimizer/path/allpaths.c @@ -346,6 +346,7 @@ set_rel_size(PlannerInfo *root, RelOptInfo *rel, switch (rel->rtekind) { case RTE_RELATION: +remove_restrictions_implied_by_constraints(root, rel, rte); if (rte->relkind == RELKIND_FOREIGN_TABLE)
Re: [HACKERS] SQL procedures
On 08.11.2017 17:23, Merlin Moncure wrote: On Tue, Oct 31, 2017 at 12:23 PM, Peter Eisentraut wrote: - Transaction control in procedure bodies This feature is really key, since it enables via SQL lots of things that are not possible without external coding, including: *) very long running processes in a single routine *) transaction isolation control inside the procedure (currently client app has to declare this) *) certain error handling cases that require client side support *) simple in-database threading *) simple construction of daemon scripts (yeah, you can use bgworker for this, but pure sql daemon with a cron heartbeat hook is hard to beat for simplicity) I do wonder how transaction control could be added later. The last time I (lightly) looked at this, I was starting to think that working transaction control into the SPI interface was the wrong approach; pl/pgsql would have to adopt a very different set of behaviors if it was called in a function or a proc. If you restricted language choice to purely SQL, you could work around this problem; SPI languages would be totally abstracted from those sets of considerations and you could always call an arbitrary language function if you needed to. SQL has no flow control but I'm not too concerned about that. merlin I am also very interested in answer on this question: how you are going to implement transaction control inside procedure? Right now in PostgresPRO EE supports autonomous transactions. Them are supported both for SQL and plpgsql/plpython APIs. Them are implemented by saving/restoring transaction context, so unlike most of other ATX implementations, in pgpro autonomous transaction is executed by the same backend. But it is not so easy to do: in Postgres almost any module have its own static variables which keeps transaction specific data. So we have to provide a dozen of suspend/resume functions: SuspendSnapshot(), SuspendPredicate(), SuspendStorage(), SuspendInvalidationInfo(), SuspendPgXact(), PgStatSuspend(), TriggerSuspend(), SuspendSPI()... and properly handle local cache invalidation. Patch consists of more than 5 thousand lines. So my question is whether you are going to implement something similar or use completely different approach? In first case it will be good to somehow unite our efforts... For example we can publish our ATX patch for Postgres 10. We have not done it yet, because there seems to be no chances to push this patch to community. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Aggregates push-down to partitions
There is a huge thread concerning pushing-down aggregates to FDW: https://www.postgresql.org/message-id/flat/CAFjFpRcnueviDpngJ3QSVvj7oyukr9NkSiCspqd4N%2BdCEdvYvg%40mail.gmail.com#cafjfprcnuevidpngj3qsvvj7oyukr9nksicspqd4n+dcedv...@mail.gmail.com but as far as I understand nothing is done for efficient calculation of aggregates for partitioned table. In case of local partitions it is somehow compensated by parallel query plan: postgres=# create table base(x integer); CREATE TABLE postgres=# create table derived1() inherits (base); CREATE TABLE postgres=# create table derived2() inherits (base); CREATE TABLE postgres=# insert into derived1 values (generate_series(1,100)); INSERT 0 100 postgres=# insert into derived2 values (generate_series(1,100)); INSERT 0 100 postgres=# explain select sum(x) from base; QUERY PLAN - Finalize Aggregate (cost=12176.63..12176.64 rows=1 width=8) -> Gather (cost=12176.59..12176.61 rows=8 width=8) Workers Planned: 8 -> Partial Aggregate (cost=12175.59..12175.60 rows=1 width=8) -> Append (cost=0.00..11510.47 rows=266048 width=4) -> Parallel Seq Scan on base (cost=0.00..0.00 rows=1 width=4) -> Parallel Seq Scan on derived1 (cost=0.00..5675.00 rows=125000 width=4) -> Parallel Seq Scan on derived2 (cost=0.00..5835.47 rows=141047 width=4) (8 rows) It is still far from ideal plan because each worker is working with all partitions, instead of spitting partitions between workers and calculate partial aggregates for each partition. But if we add FDW as a child of parent table, then parallel scan can not be used and we get the worst possible plan: postgres=# create foreign table derived_fdw() inherits(base) server pg_fdw options (table_name 'derived1');CREATE FOREIGN TABLE postgres=# explain select sum(x) from base; QUERY PLAN -- Aggregate (cost=34055.07..34055.08 rows=1 width=8) -> Append (cost=0.00..29047.75 rows=2002926 width=4) -> Seq Scan on base (cost=0.00..0.00 rows=1 width=4) -> Seq Scan on derived1 (cost=0.00..14425.00 rows=100 width=4) -> Seq Scan on derived2 (cost=0.00..14425.00 rows=100 width=4) -> Foreign Scan on derived_fdw (cost=100.00..197.75 rows=2925 width=4) (6 rows) So we sequentially pull all data to this node and compute aggregates locally. Ideal plan will calculate in parallel partial aggregates at all nodes and then combine partial results. It requires two changes: 1. Replace Aggregate->Append with Finalize_Aggregate->Append->Partial_Aggregate 2. Concurrent execution of Append. It also can be done in two different ways: we can try to use existed parallel workers infrastructure and replace Append with Gather. It seems to be the best approach for local partitioning. In case of remote (FDW) partitions, it is enough to split starting of execution (PQsendQuery in postgres_fdw) and getting results. So it requires some changes in FDW protocol. I wonder if somebody already investigate this problem or working in this direction. May be there are already some patches proposed? I have searched hackers archive, but didn't find something relevant... Are there any suggestions about the best approach to implement this feature? -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Aggregates push-down to partitions
On 10.11.2017 12:15, Ashutosh Bapat wrote: Maybe in this thread[1] your described problem are solved through introducing Parallel Append node? 1. https://www.postgresql.org/message-id/CAJ3gD9dy0K_E8r727heqXoBmWZ83HwLFwdcaSSmBQ1%2BS%2BvRuUQ%40mail.gmail.com You may want to review [2] and [3] as well. [2] https://www.postgresql.org/message-id/9666.1491295317@localhost [3] https://www.postgresql.org/message-id/CAM2+6=V64_xhstVHie0Rz=kpeqnljmzt_e314p0jat_oj9m...@mail.gmail.com Thank you very much for this references. I applied partition-wise-agg-v6 patches and for partitioned tables it works perfectly: shard=# explain select count(*) from orders; QUERY PLAN --- Finalize Aggregate (cost=100415.29..100415.30 rows=1 width=8) -> Append (cost=50207.63..100415.29 rows=2 width=8) -> Partial Aggregate (cost=50207.63..50207.64 rows=1 width=8) -> Foreign Scan on orders_0 (cost=101.00..50195.13 rows=5000 width=0) -> Partial Aggregate (cost=50207.63..50207.64 rows=1 width=8) -> Foreign Scan on orders_1 (cost=101.00..50195.13 rows=5000 width=0) (6 rows) But I wonder why the same optimization is not applied to normal inherited table: shard=# explain select count(*) from base; QUERY PLAN -- Aggregate (cost=44087.99..44088.00 rows=1 width=8) -> Append (cost=0.00..39079.46 rows=2003414 width=0) -> Seq Scan on base (cost=0.00..0.00 rows=1 width=0) -> Seq Scan on derived1 (cost=0.00..14425.00 rows=100 width=0) -> Seq Scan on derived2 (cost=0.00..14425.00 rows=100 width=0) -> Foreign Scan on derived_fdw (cost=100.00..212.39 rows=3413 width=0) (6 rows) Are there some principle problems? -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise aggregation/grouping
On 10/27/2017 02:01 PM, Jeevan Chalke wrote: Hi, Attached new patch-set here. Changes include: 1. Added separate patch for costing Append node as discussed up-front in the patch-set. 2. Since we now cost Append node, we don't need partition_wise_agg_cost_factor GUC. So removed that. The remaining patch hence merged into main implementation patch. 3. Updated rows in test-cases so that we will get partition-wise plans. Thanks I applied partition-wise-agg-v6.tar.gz patch to the master and use shard.sh example from https://www.postgresql.org/message-id/14577.1509723225%40localhost Plan for count(*) is the following: shard=# explain select count(*) from orders; QUERY PLAN --- Finalize Aggregate (cost=100415.29..100415.30 rows=1 width=8) -> Append (cost=50207.63..100415.29 rows=2 width=8) -> Partial Aggregate (cost=50207.63..50207.64 rows=1 width=8) -> Foreign Scan on orders_0 (cost=101.00..50195.13 rows=5000 width=0) -> Partial Aggregate (cost=50207.63..50207.64 rows=1 width=8) -> Foreign Scan on orders_1 (cost=101.00..50195.13 rows=5000 width=0) We really calculate partial aggregate for each partition, but to do we still have to fetch all data from remote host. So for foreign partitions such plans is absolutely inefficient. Amy be it should be combined with some other patch? For example, with agg_pushdown_v4.tgz patch https://www.postgresql.org/message-id/14577.1509723225%40localhost ? But it is not applied after partition-wise-agg-v6.tar.gz patch. Also postgres_fdw in 11dev is able to push down aggregates without agg_pushdown_v4.tgz patch. In 0009-Teach-postgres_fdw-to-push-aggregates-for-child-rela.patch there is the following check: /* Partial aggregates are not supported. */ + if (extra->isPartial) + return; If we just comment this line then produced plan will be the following: shard=# explain select sum(product_id) from orders; QUERY PLAN Finalize Aggregate (cost=308.41..308.42 rows=1 width=8) -> Append (cost=144.18..308.41 rows=2 width=8) -> Foreign Scan (cost=144.18..154.20 rows=1 width=8) Relations: Aggregate on (public.orders_0 orders) -> Foreign Scan (cost=144.18..154.20 rows=1 width=8) Relations: Aggregate on (public.orders_1 orders) (6 rows) And it is actually desired plan! Obviously such approach will not always work. FDW really doesn't support partial aggregates now. But for most frequently used aggregates: sum, min, max, count aggtype==aggtranstype and there is no difference between partial and normal aggregate calculation. So instead of (extra->isPartial) condition we can add more complex check which will traverse pathtarget expressions and check if it can be evaluated in this way. Or... extend FDW API to support partial aggregation. But even the last plan is not ideal: it will calculate predicates at each remote node sequentially. There is parallel append patch: https://www.postgresql.org/message-id/CAJ3gD9ctEcrVUmpY6fq_JUB6WDKGXAGd70EY68jVFA4kxMbKeQ%40mail.gmail.com but ... FDW doesn't support parallel scan, so parallel append can not be applied in this case. And we actually do not need parallel append with all its dynamic workers here. We just need to start commands at all remote servers and only after it fetch results (which can be done sequentially). I am investigating problem of efficient execution of OLAP queries on sharded tables (tables with remote partitions). After reading all this threads and corresponding patches, it seems to me that we already have most of parts of the puzzle, what we need is to put them on right places and may be add missed ones. I wonder if somebody is busy with it and can I somehow help here? Also I am not quite sure about the best approach with parallel execution of distributed query at all nodes. Should we make postgres_fdw parallel safe and use parallel append? How difficult it will be? Or in addition to parallel append we should also have "asynchronous append" which will be able to initiate execution at all nodes? It seems to be close to merge append, because it should simultaneously traverse all cursors. Looks like second approach is easier for implementation. But in case of sharded table, distributed query may need to traverse both remote and local shards and this approach doesn't allow to processed several local shards in parallel. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise aggregation/grouping
On 11.11.2017 23:29, Konstantin Knizhnik wrote: On 10/27/2017 02:01 PM, Jeevan Chalke wrote: Hi, Attached new patch-set here. Changes include: 1. Added separate patch for costing Append node as discussed up-front in the patch-set. 2. Since we now cost Append node, we don't need partition_wise_agg_cost_factor GUC. So removed that. The remaining patch hence merged into main implementation patch. 3. Updated rows in test-cases so that we will get partition-wise plans. Thanks I applied partition-wise-agg-v6.tar.gz patch to the master and use shard.sh example from https://www.postgresql.org/message-id/14577.1509723225%40localhost Plan for count(*) is the following: shard=# explain select count(*) from orders; QUERY PLAN --- Finalize Aggregate (cost=100415.29..100415.30 rows=1 width=8) -> Append (cost=50207.63..100415.29 rows=2 width=8) -> Partial Aggregate (cost=50207.63..50207.64 rows=1 width=8) -> Foreign Scan on orders_0 (cost=101.00..50195.13 rows=5000 width=0) -> Partial Aggregate (cost=50207.63..50207.64 rows=1 width=8) -> Foreign Scan on orders_1 (cost=101.00..50195.13 rows=5000 width=0) We really calculate partial aggregate for each partition, but to do we still have to fetch all data from remote host. So for foreign partitions such plans is absolutely inefficient. Amy be it should be combined with some other patch? For example, with agg_pushdown_v4.tgz patch https://www.postgresql.org/message-id/14577.1509723225%40localhost ? But it is not applied after partition-wise-agg-v6.tar.gz patch. Also postgres_fdw in 11dev is able to push down aggregates without agg_pushdown_v4.tgz patch. In 0009-Teach-postgres_fdw-to-push-aggregates-for-child-rela.patch there is the following check: /* Partial aggregates are not supported. */ + if (extra->isPartial) + return; If we just comment this line then produced plan will be the following: shard=# explain select sum(product_id) from orders; QUERY PLAN Finalize Aggregate (cost=308.41..308.42 rows=1 width=8) -> Append (cost=144.18..308.41 rows=2 width=8) -> Foreign Scan (cost=144.18..154.20 rows=1 width=8) Relations: Aggregate on (public.orders_0 orders) -> Foreign Scan (cost=144.18..154.20 rows=1 width=8) Relations: Aggregate on (public.orders_1 orders) (6 rows) And it is actually desired plan! Obviously such approach will not always work. FDW really doesn't support partial aggregates now. But for most frequently used aggregates: sum, min, max, count aggtype==aggtranstype and there is no difference between partial and normal aggregate calculation. So instead of (extra->isPartial) condition we can add more complex check which will traverse pathtarget expressions and check if it can be evaluated in this way. Or... extend FDW API to support partial aggregation. But even the last plan is not ideal: it will calculate predicates at each remote node sequentially. There is parallel append patch: https://www.postgresql.org/message-id/CAJ3gD9ctEcrVUmpY6fq_JUB6WDKGXAGd70EY68jVFA4kxMbKeQ%40mail.gmail.com but ... FDW doesn't support parallel scan, so parallel append can not be applied in this case. And we actually do not need parallel append with all its dynamic workers here. We just need to start commands at all remote servers and only after it fetch results (which can be done sequentially). I am investigating problem of efficient execution of OLAP queries on sharded tables (tables with remote partitions). After reading all this threads and corresponding patches, it seems to me that we already have most of parts of the puzzle, what we need is to put them on right places and may be add missed ones. I wonder if somebody is busy with it and can I somehow help here? Also I am not quite sure about the best approach with parallel execution of distributed query at all nodes. Should we make postgres_fdw parallel safe and use parallel append? How difficult it will be? Or in addition to parallel append we should also have "asynchronous append" which will be able to initiate execution at all nodes? It seems to be close to merge append, because it should simultaneously traverse all cursors. Looks like second approach is easier for implementation. But in case of sharded table, distributed query may need to traverse both remote and local shards and this approach doesn't allow to processed several local shards in parallel. I attach small patch for postgres_fdw.c which allows concurrent execution of aggregates by all remote servers (when them are accessed through postgres_fdw). I have added "postgres_fdw.use_
[HACKERS] On conflict update & hint bits
Hi, I am faced with rarely reproduced problem at our multimaster (and never at vanilla Postgres). We are using our own customized transaction manager, so it may be definitely the problem in our multimaster. But stack trace looks suspiciously and this is why I want to consult with people familiar with this code whether it is bug in ExecOnConflictUpdate or not. Briefly: ExecOnConflictUpdate tries to set hint bit without holding lock on the buffer and so get assertion failure in MarkBufferDirtyHint. Now stack trace: #0 0x7fe3b940acc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 #1 0x7fe3b940e0d8 in __GI_abort () at abort.c:89 #2 0x0097b996 in ExceptionalCondition (conditionName=0xb4d970 "!(LWLockHeldByMe(((LWLock*) (&(bufHdr)->content_lock", errorType=0xb4d2e9 "FailedAssertion", fileName=0xb4d2e0 "bufmgr.c", lineNumber=3380) at assert.c:54 #3 0x007e365b in MarkBufferDirtyHint (buffer=946, buffer_std=1 '\001') at bufmgr.c:3380 #4 0x009c3660 in SetHintBits (tuple=0x7fe396a9d858, buffer=946, infomask=256, xid=1398) at tqual.c:136 #5 0x009c5194 in HeapTupleSatisfiesMVCC (htup=0x7ffc00169030, snapshot=0x2e79778, buffer=946) at tqual.c:1065 #6 0x006ace83 in ExecCheckHeapTupleVisible (estate=0x2e81ae8, tuple=0x7ffc00169030, buffer=946) at nodeModifyTable.c:197 #7 0x006ae343 in ExecOnConflictUpdate (mtstate=0x2e81d50, resultRelInfo=0x2e81c38, conflictTid=0x7ffc001690c0, planSlot=0x2e82428, excludedSlot=0x2e82428, estate=0x2e81ae8, canSetTag=1 '\001', returning=0x7ffc001690c8) at nodeModifyTable.c:1173 #8 0x006ad256 in ExecInsert (mtstate=0x2e81d50, slot=0x2e82428, planSlot=0x2e82428, arbiterIndexes=0x2e7eeb0, onconflict=ONCONFLICT_UPDATE, estate=0x2e81ae8, canSetTag=1 '\001') at nodeModifyTable.c:395 #9 0x006aebe3 in ExecModifyTable (node=0x2e81d50) at nodeModifyTable.c:1496 In ExecOnConflictUpdate buffer is pinned but not locked: /* * Lock tuple for update. Don't follow updates when tuple cannot be * locked without doing so. A row locking conflict here means our * previous conclusion that the tuple is conclusively committed is not * true anymore. */ tuple.t_self = *conflictTid; test = heap_lock_tuple(relation, &tuple, estate->es_output_cid, lockmode, LockWaitBlock, false, &buffer, &hufd); heap_lock_tuple is pinning buffer but not locking it: **buffer: set to buffer holding tuple (pinned but not locked at exit) Later we try to check tuple visibility: ExecCheckHeapTupleVisible(estate, &tuple, buffer); and inside HeapTupleSatisfiesMVCC try to set hint bit. MarkBufferDirtyHint assumes that buffer is locked: * 2. The caller might have only share-lock instead of exclusive-lock on the * buffer's content lock. and we get assertion failure in /* here, either share or exclusive lock is OK */ Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr))); So the question is whether it is correct that ExecOnConflictUpdate tries to access and update tuple without holding lock on the buffer? Thank in advance, -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On conflict update & hint bits
On 30.09.2016 19:37, Peter Geoghegan wrote: On Fri, Sep 30, 2016 at 5:33 PM, Konstantin Knizhnik wrote: Later we try to check tuple visibility: ExecCheckHeapTupleVisible(estate, &tuple, buffer); and inside HeapTupleSatisfiesMVCC try to set hint bit. So, you're using repeatable read or serializable isolation level? Repeatable read. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On conflict update & hint bits
On 24.10.2016 00:49, Peter Geoghegan wrote: On Sun, Oct 23, 2016 at 2:46 PM, Tom Lane wrote: What's bothering me is (a) it's less than 24 hours to release wrap and (b) this patch changes SSI-relevant behavior and hasn't been approved by Kevin. I'm not familiar enough with that logic to commit a change in it on my own authority, especially with so little time for problems to be uncovered. I should point out that I knew that the next set of point releases had been moved forward much later than you did. I had to work on this fix during the week, which was pretty far from ideal for me for my own reasons. Just for information: I know that you are working on this issue, but as far as we need to proceed further with our testing of multimaster, I have done the following obvious changes and it fixes the problem (at least this assertion failure is not happen any more): src/backend/executor/nodeModifyTable.c @@ -1087,6 +1087,13 @@ ExecOnConflictUpdate(ModifyTableState *mtstate, test = heap_lock_tuple(relation, &tuple, estate->es_output_cid, lockmode, LockWaitBlock, false, &buffer, &hufd); +/* + * We must hold share lock on the buffer content while examining tuple + * visibility. Afterwards, however, the tuples we have found to be + * visible are guaranteed good as long as we hold the buffer pin. + */ +LockBuffer(buffer, BUFFER_LOCK_SHARE); + switch (test) { case HeapTupleMayBeUpdated: @@ -1142,6 +1149,7 @@ ExecOnConflictUpdate(ModifyTableState *mtstate, * loop here, as the new version of the row might not conflict * anymore, or the conflicting tuple has actually been deleted. */ +LockBuffer(buffer, BUFFER_LOCK_UNLOCK); ReleaseBuffer(buffer); return false; @@ -1175,6 +1183,8 @@ ExecOnConflictUpdate(ModifyTableState *mtstate, /* Store target's existing tuple in the state's dedicated slot */ ExecStoreTuple(&tuple, mtstate->mt_existing, buffer, false); +LockBuffer(buffer, BUFFER_LOCK_UNLOCK); + /* * Make tuple and any needed join variables available to ExecQual and * ExecProject. The EXCLUDED tuple is installed in ecxt_innertuple, while -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Unlogged tables cleanup
Hi, hackers I wonder if such behavior can be considered as a bug: knizhnik@knizhnik:~/dtm-data$ psql postgres psql (10devel) Type "help" for help. postgres=# create tablespace fs location '/home/knizhnik/dtm-data/fs'; CREATE TABLESPACE postgres=# set default_tablespace=fs; SET postgres=# create unlogged table foo(x integer); CREATE TABLE postgres=# insert into foo values(generate_series(1,10)); INSERT 0 10 Now simulate server crash using using "pkill -9 postgres". knizhnik@knizhnik:~/dtm-data$ rm -f logfile ; pg_ctl -D pgsql.master -l logfile start pg_ctl: another server might be running; trying to start server anyway server starting knizhnik@knizhnik:~/dtm-data$ psql postgres psql (10devel) Type "help" for help. postgres=# select * from foo; ERROR: could not open file "pg_tblspc/16384/PG_10_201611041/12289/16385": No such file or directory knizhnik@knizhnik:~/dtm-data$ ls fs PG_10_201611041 knizhnik@knizhnik:~/dtm-data$ ls fs/PG_10_201611041/ So all relation directory is removed! It happens only for first table created in tablespace. If you create table in Postgres data directory everything is ok: first segment of relation is truncated but not deleted. Also if you create one more unlogged table in tablespace it is truncated correctly: postgres=# set default_tablespace=fs; SET postgres=# create unlogged table foo1(x integer); CREATE TABLE postgres=# insert into foo1 values(generate_series(1,10)); INSERT 0 10 postgres=# \q knizhnik@knizhnik:~/dtm-data$ pkill -9 postgres knizhnik@knizhnik:~/dtm-data$ rm -f logfile ; pg_ctl -D pgsql.master -l logfile start pg_ctl: another server might be running; trying to start server anyway server starting knizhnik@knizhnik:~/dtm-data$ psql postgres psql (10devel) Type "help" for help. postgres=# select * from foo1; x --- (0 rows) knizhnik@knizhnik:~/dtm-data$ ls -l fs/PG_10_201611041/12289/* -rw--- 1 knizhnik knizhnik 0 Nov 9 19:52 fs/PG_10_201611041/12289/32768 -rw--- 1 knizhnik knizhnik 0 Nov 9 19:52 fs/PG_10_201611041/12289/32768_init -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Unlogged tables cleanup
On Nov 10, 2016, at 10:17 AM, Michael Paquier wrote: > > Hm.. I cannot reproduce what you see on Linux or macos. Perhaps you > have locally a standby pointing as well to this tablespace? No, it is latest sources from Postgres repository. Please notice that you should create new database and tablespace to reproduce this issue. So actually the whole sequence is mkdir fs initdb -D pgsql pg_ctl -D pgsql -l logfile start psql postgres # create tablespace fs location '/home/knizhnik/dtm-data/fs'; # set default_tablespace=fs; # create unlogged table foo(x integer); # insert into foo values(generate_series(1,10)); # ^D pkill -9 postgres pg_ctl -D pgsql -l logfile start # select * from foo; > -- > Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Two questions about Postgres parser
Hi hackers, Working on vectorized extension for Postgres (VOPS) I faced with two things in Postgres compiler which break my expectations and force me to abandon my original implementation plan. I wonder if it is really principle and correct that: 1. Moving-aggregate implementation should return the same type as plain implementation. Yes, in most cases it is hard to find arguments why them should return different types. But it is not true for vectorized operations... 2. Implicit user defined type casts are not applied for COALESCE operator: create type complex as (x float8, y float8); create function float2complex(x float8) returns complex as $$ declare complex c; begin c.x := x; x.y = 0; return c; $$ language plpgsql strict immutable; create cast (float8 as complex) with function float2complex(float8) as implicit; create table foo(c complex); select coalesce(c, 0.0) from foo; ERROR: COALESCE types complex and numeric cannot be matched LINE 1: select coalesce(c, 0.0) from foo; select coalesce(c, 0.0::float8) from foo; ERROR: COALESCE types complex and double precision cannot be matched LINE 1: select coalesce(c, 0.0::float8) from foo; select coalesce(c, 0.0::float8::complex) from foo; coalesce -- (0 rows) -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: Faster Expression Processing v4
On 13.03.2017 11:03, Andres Freund wrote: Hi, On 2017-03-12 05:40:51 +0100, Tomas Vondra wrote: I wanted to do a bit of testing and benchmarking on this, but 0004 seems to be a bit broken. Well, "broken" in the sense that it's already outdated, because other stuff that got merged. The patch does not apply anymore - there are some conflicts in execQual.c, but I think I fixed those. But then I ran into a bunch of compile-time errors, because some of the executor nodes still reference bits that were moved elsewhere. Updated patch attached. Note that this patch has two changes I've not yet evaluated performance-wise. I got the following results at my system with Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz, 16Gb RAM, TPC-H Q1/Q6 scale 10, sharedBuffers=8Gb, pg_prewarm on lineitem table projection: Q1 Q6 Master 7503 ms 1171 ms Your patch 6420 ms 1034 ms VOPS 396 ms 249 ms VOPS + patch367 ms 233 ms -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
[HACKERS] Parallel query execution with SPI
Hi hackers, It is possible to execute query concurrently using SPI? If so, how it can be enforced? I tried to open cursor with CURSOR_OPT_PARALLEL_OK flag but it doesn't help: query is executed by single backend while the same query been launched at top level uses parallel plan: fsstate->portal = SPI_cursor_open_with_args(NULL, fsstate->query, fsstate->numParams, argtypes, values, nulls, true, CURSOR_OPT_PARALLEL_OK); ... SPI_cursor_fetch(fsstate->portal, true, 1); Thanks in advance, -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel query execution with SPI
On 31.03.2017 13:48, Robert Haas wrote: On Fri, Mar 31, 2017 at 3:33 AM, Konstantin Knizhnik wrote: It is possible to execute query concurrently using SPI? If so, how it can be enforced? I tried to open cursor with CURSOR_OPT_PARALLEL_OK flag but it doesn't help: query is executed by single backend while the same query been launched at top level uses parallel plan: fsstate->portal = SPI_cursor_open_with_args(NULL, fsstate->query, fsstate->numParams, argtypes, values, nulls, true, CURSOR_OPT_PARALLEL_OK); ... SPI_cursor_fetch(fsstate->portal, true, 1); Parallel execution isn't possible if you are using a cursor-type interface, because a parallel query can't be suspended and resumed like a non-parallel query. If you use a function that executes the query to completion in one go, like SPI_execute_plan, then it's cool. See also commit 61c2e1a95f94bb904953a6281ce17a18ac38ee6d. Thank you very much for explanation. In case of using SPI_execute the query is really executed concurrently. But it means that when I am executing some query using SPI, I need to somehow predict number of returned tuples. If it is not so much, then it is better to use SPI_execute to allow concurrent execution of the query. But if it is large enough, then SPI_execute without limit can cause memory overflow. Certainly I can specify some reasonable limit and it if is reached, then use cursor instead. But it is neither convenient, neither efficient. I wonder if somebody can suggest better solution? -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] FDW and parallel execution
-> Hash Join (cost=20.05..40.79 rows=15 width=36) Hash Cond: (nation.n_regionkey = region.r_regionkey) -> Seq Scan on nation (cost=0.00..17.70 rows=770 width=40) -> Hash (cost=20.00..20.00 rows=4 width=4) -> Seq Scan on region (cost=0.00..20.00 rows=4 width=4) Filter: ((r_name)::text = 'ASIA'::text) -> Hash (cost=294718.76..294718.76 rows=2284376 width=8) -> Foreign Scan on orders_fdw (cost=0.00..294718.76 rows=2284376 width=8) -> Hash (cost=32605.64..32605.64 rows=1500032 width=8) -> Foreign Scan on customer_fdw (cost=0.00..32605.64 rows=1500032 width=8) The plans look very similar, but first one is parallel and second - not. My FDW provides implementation for IsForeignScanParallelSafe which returns true. I wonder what can prevent optimizer from using parallel plan in this case? Thank in advance, -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] FDW and parallel execution
On 04.04.2017 13:29, Kyotaro HORIGUCHI wrote: Hi, At Sun, 02 Apr 2017 16:30:24 +0300, Konstantin Knizhnik wrote in <58e0fcf0.2070...@postgrespro.ru> Hi hackers and personally Robet (you are the best expert in both areas). I want to ask one more question concerning parallel execution and FDW. Below are two plans for the same query (TPC-H Q5): one for normal tables, another for FDW to vertical representation of the same data. FDW supports analyze function and is expected to produce the similar statistic as for original tables. The plans look very similar, but first one is parallel and second - not. My FDW provides implementation for IsForeignScanParallelSafe which returns true. I wonder what can prevent optimizer from using parallel plan in this case? Parallel execution requires partial paths. It's the work for GetForeignPaths of your FDW. Thank you very much for explanation. But unfortunately I still do not completely understand what kind of queries allow parallel execution with FDW. Section "FDW Routines for Parallel Execution" of FDW specification says: A ForeignScan node can, optionally, support parallel execution. A parallel ForeignScan will be executed in multiple processes and should return each row only once across all cooperating processes. To do this, processes can coordinate through fixed size chunks of dynamic shared memory. This shared memory is not guaranteed to be mapped at the same address in every process, so pointers may not be used. The following callbacks are all optional in general, but required if parallel execution is to be supported. I provided IsForeignScanParallelSafe, EstimateDSMForeignScan, InitializeDSMForeignSca and InitializeWorkerForeignScan in my FDW. IsForeignScanParallelSafe returns true. Also in GetForeignPaths function I created path with baserel->consider_parallel == true. Is it enough or I should do something else? But unfortunately I failed to find any query: sequential scan, grand aggregation, aggregation with group by, joins... when parallel execution plan is used for this FDW. Also there are no examples of using this functions in Postgres distributive and I failed to find any such examples in Internet. Can somebody please clarify my situation with parallel execution and FDW and may be point at some examples? Thank in advance. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] ASOF join
On 16.06.2017 19:07, David Fetter wrote: On Fri, Jun 16, 2017 at 11:51:34AM +1200, Thomas Munro wrote: On Fri, Jun 16, 2017 at 4:20 AM, Konstantin Knizhnik wrote: I wonder if there were some discussion/attempts to add ASOF join to Postgres (sorry, may be there is better term for it, I am refereeing KDB definition: http://code.kx.com/wiki/Reference/aj ). Interesting idea. Also in Pandas: http://pandas.pydata.org/pandas-docs/version/0.19.0/generated/pandas.merge_asof.html#pandas.merge_asof I attached simple patch adding ASOF join to Postgres. Right now it support only outer join and requires USING clause (consequently it is not possible to join two tables which joi keys has different names. May be it is also possible to support ON clause with condition written like o.k1 = i.k2 AND o.k2 = i.k2 AND ... AND o.kN >= i.kN But such notation can be confusing, because join result includes only one matching inner record with kN smaller or equal than kN of outer record and not all such records. As alternative we can add specia If people fin such construction really useful, I will continue work on it. I suppose you could write a function that pulls tuples out of a bunch of cursors and zips them together like this, as a kind of hand-coded special merge join "except that we match on nearest key rather than equal keys" (as they put it). I've written code like this before in a trading context, where we called that 'previous tick interpolation', and in a scientific context where other kinds of interpolation were called for (so not really matching a tuple but synthesising one if no exact match). If you view the former case as a kind of degenerate case of interpolation then it doesn't feel like a "join" as we know it, but clearly it is. I had never considered before that such things might belong inside the database as a kind of join operator. If you turn your head sideways, it's very similar to the range merge join Jeff Davis proposed. https://commitfest.postgresql.org/14/1106/ May be, but I do not understand how to limit result to contain exactly one (last) inner tuple for each outer tuple. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c index 482a3dd..f7a8f38 100644 --- a/contrib/postgres_fdw/deparse.c +++ b/contrib/postgres_fdw/deparse.c @@ -1324,6 +1324,9 @@ get_jointype_name(JoinType jointype) case JOIN_FULL: return "FULL"; + case JOIN_ASOF: + return "ASOF"; + default: /* Shouldn't come here, but protect from buggy code. */ elog(ERROR, "unsupported join type %d", jointype); diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c index 080cb0a..54cf6c1 100644 --- a/contrib/postgres_fdw/postgres_fdw.c +++ b/contrib/postgres_fdw/postgres_fdw.c @@ -4073,7 +4073,7 @@ foreign_join_ok(PlannerInfo *root, RelOptInfo *joinrel, JoinType jointype, * Constructing queries representing SEMI and ANTI joins is hard, hence * not considered right now. */ - if (jointype != JOIN_INNER && jointype != JOIN_LEFT && + if (jointype != JOIN_INNER && jointype != JOIN_LEFT && jointype != JOIN_ASOF && jointype != JOIN_RIGHT && jointype != JOIN_FULL) return false; @@ -4211,6 +4211,7 @@ foreign_join_ok(PlannerInfo *root, RelOptInfo *joinrel, JoinType jointype, break; case JOIN_LEFT: + case JOIN_ASOF: fpinfo->joinclauses = list_concat(fpinfo->joinclauses, list_copy(fpinfo_i->remote_conds)); fpinfo->remote_conds = list_concat(fpinfo->remote_conds, diff --git a/doc/src/sgml/ref/select.sgml b/doc/src/sgml/ref/select.sgml index 211e4c3..fd3be8c 100644 --- a/doc/src/sgml/ref/select.sgml +++ b/doc/src/sgml/ref/select.sgml @@ -514,6 +514,9 @@ TABLE [ ONLY ] table_name [ * ] CROSS JOIN + + ASOF [ OUTER ] JOIN + For the INNER and OUTER join types, a @@ -523,7 +526,9 @@ TABLE [ ONLY ] table_name [ * ] USING (join_column [, ...]). See below for the meaning. For CROSS JOIN, -none of these clauses can appear. +none of these clauses can appear. For ASOF join type, a +join condition must be USING (join_column [, ...]). @@ -571,6 +576,32 @@ TABLE [ ONLY ] table_name [ * ] on the right), plus one row for each unmatched right-hand row (extended with nulls on the left). + + ASOF OUTER JOIN is similar to LEFT OUTER JOIN but it accepts only +USING (join_column_1 [, ...], join_column_N) clause +where last joined column join_column_N is expected to be timestamp +(but actually can have any comparable type) and outer tuple is matched with only one inner tuple with the s
Re: [HACKERS] ASOF join
On 21.06.2017 11:00, Thomas Munro wrote: Hmm. Yeah, I see the notational problem. It's hard to come up with a new syntax that has SQL nature. What if... we didn't use a new syntax at all, but recognised existing queries that are executable with this strategy? Queries like this: WITH ticks(time, price) AS (VALUES ('2017-07-20 12:00:00'::timestamptz, 100.00), ('2017-07-21 11:00:00'::timestamptz, 150.00)), times(time) AS (VALUES ('2017-07-19 12:00:00'::timestamptz), ('2017-07-20 12:00:00'::timestamptz), ('2017-07-21 12:00:00'::timestamptz), ('2017-07-22 12:00:00'::timestamptz)) SELECT times.time, previous_tick.price FROM times LEFT JOIN LATERAL (SELECT * FROM ticks WHERE ticks.time <= times.time ORDER BY ticks.time DESC LIMIT 1) previous_tick ON true ORDER BY times.time; time | price + 2017-07-19 12:00:00+12 | 2017-07-20 12:00:00+12 | 100.00 2017-07-21 12:00:00+12 | 150.00 2017-07-22 12:00:00+12 | 150.00 (4 rows) I haven't used LATERAL much myself but I've noticed that it's often used to express this type of thing. "Get me the latest ... as of time ...". It'd a bit like the way we recognise EXISTS (...) as a semi-join and execute it with a join operator instead of having a SEMI JOIN syntax. On the other hand it's a bit more long winded, extreme and probably quite niche. Thank you for this idea. I agree that it is the best way of implementing ASOF join - just as optimization of standard SQL query. But do you think that still it will be good idea to extend SQL syntax with ASOF JOIN ... USING ... clause? It will significantly simplify writing queries like above and IMHO doesn't introduce some confusions with standard SQL syntax. My primary idea of suggesting ASOF join for Postgres was not just building more efficient plan (using merge join instead of nested loop) but also simplifying writing of such queries. Or do you think that nobody will be interested in non-standard SQL extensions? -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Error : undefined symbol : LWLockAssign in 9.6.3
On 08/09/2017 07:07 AM, 송기훈 wrote: 본문 이미지 1 Hi. I'm trying to use imcs module with 9.6 and got this error message. LWLockAssign function has been deleted from 9.6. I can't use this module anymore from 9.6. What I want to ask you something is that your team decides not to support imcs module anymore or doesn't concern about imcs module or are there any ways to run postgresql in memory only? Hi, I am author of IMCS module and performing support of it. Please contact to me directly. I have committed patch in https://github.com/knizhnik/imcs.git repository which allows to use IMCS with 9.6.3 and later Postgres versions. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
[HACKERS] Secondary index access optimizations
Hi hackers, I am trying to compare different ways of optimizing work with huge append-only tables in PostgreSQL where primary key is something like timestamp and queries are usually accessing most recent data using some secondary keys. Size of secondary index is one of the most critical factors limiting insert/search performance. As far as data is inserted in timestamp ascending order, access to primary key is well localized and accessed tables are present in memory. But if we create secondary key for the whole table, then access to it will require random reads from the disk and significantly decrease performance. There are two well known solutions of the problem: 1. Table partitioning 2. Partial indexes This approaches I want to compare. First of all I want to check if optimizer is able to generate efficient query execution plan covering different time intervals. Unfortunately in both cases generated plan is not optimal. 1. Table partitioning: create table base (k integer primary key, v integer); create table part1 (check (k between 1 and 1)) inherits (base); create table part2 (check (k between 10001 and 2)) inherits (base); create index pi1 on part1(v); create index pi2 on part2(v); insert int part1 values (generate series(1,1), random()); insert into part2 values (generate_series(10001,2), random()); explain select * from base where k between 1 and 2 and v = 100; QUERY PLAN --- Append (cost=0.00..15.65 rows=3 width=8) -> Seq Scan on base (cost=0.00..0.00 rows=1 width=8) Filter: ((k >= 1) AND (k <= 2) AND (v = 100)) -> Index Scan using pi1 on part1 (cost=0.29..8.31 rows=1 width=8) Index Cond: (v = 100) Filter: ((k >= 1) AND (k <= 2)) -> Index Scan using pi2 on part2 (cost=0.29..7.34 rows=1 width=8) Index Cond: (v = 100) Filter: ((k >= 1) AND (k <= 2)) Questions: - Is there some way to avoid sequential scan of parent table? Yes, it is empty and so sequential scan will not take much time, but ... it still requires some additional actions and so increasing query execution time. - Why index scan of partition indexes includes filter condition if it is guaranteed by check constraint that all records of this partition match search predicate? 2. Partial indexes: create table t (k integer primary key, v integer); insert into t values (generate_series(1,2),random()); create index i1 on t(v) where k between 1 and 1; create index i2 on t(v) where k between 10001 and 2; postgres=# explain select * from t where k between 1 and 1 and v = 100; QUERY PLAN Index Scan using i1 on t (cost=0.29..7.28 rows=1 width=8) Index Cond: (v = 100) (2 rows) Here we get perfect plan. Let's try to extend search interval: postgres=# explain select * from t where k between 1 and 2 and v = 100; QUERY PLAN -- Index Scan using t_pkey on t (cost=0.29..760.43 rows=1 width=8) Index Cond: ((k >= 1) AND (k <= 2)) Filter: (v = 100) (3 rows) Unfortunately in this case Postgres is not able to apply partial indexes. And this is what I expected to get: postgres=# explain select * from t where k between 1 and 1 and v = 100 union all select * from t where k between 10001 and 2 and v = 100; QUERY PLAN -- Append (cost=0.29..14.58 rows=2 width=8) -> Index Scan using i1 on t (cost=0.29..7.28 rows=1 width=8) Index Cond: (v = 100) -> Index Scan using i2 on t t_1 (cost=0.29..7.28 rows=1 width=8) Index Cond: (v = 100) I wonder if there are some principle problems in supporting this two things in optimizer: 1. Remove search condition for primary key if it is fully satisfied by derived table check constraint. 2. Append index scans of several partial indexes if specified interval is covered by their conditions. I wonder if someone is familiar with this part of optimizer ad can easily fix it. Otherwise I am going to spend some time on solving this problems (if community think that such optimizations will be useful). -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Secondary index access optimizations
On 14.08.2017 12:37, Konstantin Knizhnik wrote: Hi hackers, I am trying to compare different ways of optimizing work with huge append-only tables in PostgreSQL where primary key is something like timestamp and queries are usually accessing most recent data using some secondary keys. Size of secondary index is one of the most critical factors limiting insert/search performance. As far as data is inserted in timestamp ascending order, access to primary key is well localized and accessed tables are present in memory. But if we create secondary key for the whole table, then access to it will require random reads from the disk and significantly decrease performance. There are two well known solutions of the problem: 1. Table partitioning 2. Partial indexes This approaches I want to compare. First of all I want to check if optimizer is able to generate efficient query execution plan covering different time intervals. Unfortunately in both cases generated plan is not optimal. 1. Table partitioning: create table base (k integer primary key, v integer); create table part1 (check (k between 1 and 1)) inherits (base); create table part2 (check (k between 10001 and 2)) inherits (base); create index pi1 on part1(v); create index pi2 on part2(v); insert int part1 values (generate series(1,1), random()); insert into part2 values (generate_series(10001,2), random()); explain select * from base where k between 1 and 2 and v = 100; QUERY PLAN --- Append (cost=0.00..15.65 rows=3 width=8) -> Seq Scan on base (cost=0.00..0.00 rows=1 width=8) Filter: ((k >= 1) AND (k <= 2) AND (v = 100)) -> Index Scan using pi1 on part1 (cost=0.29..8.31 rows=1 width=8) Index Cond: (v = 100) Filter: ((k >= 1) AND (k <= 2)) -> Index Scan using pi2 on part2 (cost=0.29..7.34 rows=1 width=8) Index Cond: (v = 100) Filter: ((k >= 1) AND (k <= 2)) Questions: - Is there some way to avoid sequential scan of parent table? Yes, it is empty and so sequential scan will not take much time, but ... it still requires some additional actions and so increasing query execution time. - Why index scan of partition indexes includes filter condition if it is guaranteed by check constraint that all records of this partition match search predicate? 2. Partial indexes: create table t (k integer primary key, v integer); insert into t values (generate_series(1,2),random()); create index i1 on t(v) where k between 1 and 1; create index i2 on t(v) where k between 10001 and 2; postgres=# explain select * from t where k between 1 and 1 and v = 100; QUERY PLAN Index Scan using i1 on t (cost=0.29..7.28 rows=1 width=8) Index Cond: (v = 100) (2 rows) Here we get perfect plan. Let's try to extend search interval: postgres=# explain select * from t where k between 1 and 2 and v = 100; QUERY PLAN -- Index Scan using t_pkey on t (cost=0.29..760.43 rows=1 width=8) Index Cond: ((k >= 1) AND (k <= 2)) Filter: (v = 100) (3 rows) Unfortunately in this case Postgres is not able to apply partial indexes. And this is what I expected to get: postgres=# explain select * from t where k between 1 and 1 and v = 100 union all select * from t where k between 10001 and 2 and v = 100; QUERY PLAN -- Append (cost=0.29..14.58 rows=2 width=8) -> Index Scan using i1 on t (cost=0.29..7.28 rows=1 width=8) Index Cond: (v = 100) -> Index Scan using i2 on t t_1 (cost=0.29..7.28 rows=1 width=8) Index Cond: (v = 100) I wonder if there are some principle problems in supporting this two things in optimizer: 1. Remove search condition for primary key if it is fully satisfied by derived table check constraint. 2. Append index scans of several partial indexes if specified interval is covered by their conditions. I wonder if someone is familiar with this part of optimizer ad can easily fix it. Otherwise I am going to spend some time on solving this problems (if community think that such optimizations will be useful). Replying to myself: the following small patch removes redundant checks from index scans for derived tables: diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c index 939045d..1f7c9cf 100644 --- a/src/backend/optimizer/util/plancat.c +++ b/src/backend/optimizer/util/plancat.c @@ -1441,6 +1441,20 @@ relation_excluded_by_constraints(PlannerInfo *root, if (predicate_refuted_by(safe_constraints, rel->baseres
Re: [HACKERS] Secondary index access optimizations
On 14.08.2017 19:33, Konstantin Knizhnik wrote: On 14.08.2017 12:37, Konstantin Knizhnik wrote: Hi hackers, I am trying to compare different ways of optimizing work with huge append-only tables in PostgreSQL where primary key is something like timestamp and queries are usually accessing most recent data using some secondary keys. Size of secondary index is one of the most critical factors limiting insert/search performance. As far as data is inserted in timestamp ascending order, access to primary key is well localized and accessed tables are present in memory. But if we create secondary key for the whole table, then access to it will require random reads from the disk and significantly decrease performance. There are two well known solutions of the problem: 1. Table partitioning 2. Partial indexes This approaches I want to compare. First of all I want to check if optimizer is able to generate efficient query execution plan covering different time intervals. Unfortunately in both cases generated plan is not optimal. 1. Table partitioning: create table base (k integer primary key, v integer); create table part1 (check (k between 1 and 1)) inherits (base); create table part2 (check (k between 10001 and 2)) inherits (base); create index pi1 on part1(v); create index pi2 on part2(v); insert int part1 values (generate series(1,1), random()); insert into part2 values (generate_series(10001,2), random()); explain select * from base where k between 1 and 2 and v = 100; QUERY PLAN --- Append (cost=0.00..15.65 rows=3 width=8) -> Seq Scan on base (cost=0.00..0.00 rows=1 width=8) Filter: ((k >= 1) AND (k <= 2) AND (v = 100)) -> Index Scan using pi1 on part1 (cost=0.29..8.31 rows=1 width=8) Index Cond: (v = 100) Filter: ((k >= 1) AND (k <= 2)) -> Index Scan using pi2 on part2 (cost=0.29..7.34 rows=1 width=8) Index Cond: (v = 100) Filter: ((k >= 1) AND (k <= 2)) Questions: - Is there some way to avoid sequential scan of parent table? Yes, it is empty and so sequential scan will not take much time, but ... it still requires some additional actions and so increasing query execution time. - Why index scan of partition indexes includes filter condition if it is guaranteed by check constraint that all records of this partition match search predicate? 2. Partial indexes: create table t (k integer primary key, v integer); insert into t values (generate_series(1,2),random()); create index i1 on t(v) where k between 1 and 1; create index i2 on t(v) where k between 10001 and 2; postgres=# explain select * from t where k between 1 and 1 and v = 100; QUERY PLAN Index Scan using i1 on t (cost=0.29..7.28 rows=1 width=8) Index Cond: (v = 100) (2 rows) Here we get perfect plan. Let's try to extend search interval: postgres=# explain select * from t where k between 1 and 2 and v = 100; QUERY PLAN -- Index Scan using t_pkey on t (cost=0.29..760.43 rows=1 width=8) Index Cond: ((k >= 1) AND (k <= 2)) Filter: (v = 100) (3 rows) Unfortunately in this case Postgres is not able to apply partial indexes. And this is what I expected to get: postgres=# explain select * from t where k between 1 and 1 and v = 100 union all select * from t where k between 10001 and 2 and v = 100; QUERY PLAN -- Append (cost=0.29..14.58 rows=2 width=8) -> Index Scan using i1 on t (cost=0.29..7.28 rows=1 width=8) Index Cond: (v = 100) -> Index Scan using i2 on t t_1 (cost=0.29..7.28 rows=1 width=8) Index Cond: (v = 100) I wonder if there are some principle problems in supporting this two things in optimizer: 1. Remove search condition for primary key if it is fully satisfied by derived table check constraint. 2. Append index scans of several partial indexes if specified interval is covered by their conditions. I wonder if someone is familiar with this part of optimizer ad can easily fix it. Otherwise I am going to spend some time on solving this problems (if community think that such optimizations will be useful). Replying to myself: the following small patch removes redundant checks from index scans for derived tables: diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c index 939045d..1f7c9cf 100644 --- a/src/backend/optimizer/util/plancat.c +++ b/src/backend/optimizer/util/plancat.c @@ -1441,6 +1441,20 @@ relation_excluded_by_constraints(PlannerInfo *root, if (predicate_refuted
Re: [HACKERS] [PATCH] Push limit to sort through a subquery
On 29.04.2017 00:13, Douglas Doole wrote: If you add this to the commitfest app, more people might look at it when the next commitfest opens. I have added it. https://commitfest.postgresql.org/14/1119/ Also, it might help if you can provide a query/ies with numbers where this optimization shows improvement. I can't provide the real queries where we encountered the problem because they are internal. However I showed a simplified version of the queries in my first post. On our queries, the change made quite a difference - execution time dropped from 31.4 seconds to 7.2 seconds. Explain analyze also shows that memory use dropped significantly and we didn't have to spill the sort to disk From: -> Sort (cost=989.95..1013.27 rows=9326 width=30) (node_startup_time/loop=31328.891, node_total_time/loop: 31329.756 rows=2001 loops=1) Buffers: temp read=772 written=11201 lsm_bufmgr hits=3392 Sort Key: *** Sort Method: external merge Sort Space Used: 89592 Sort Space Type: Disk To: -> Sort (cost=989.95..1013.27 rows=9326 width=30) (node_startup_time/loop=7123.275, node_total_time/loop: 7123.504 rows=2001 loops=1) Buffers: lsm_bufmgr hits=3387 Sort Key: *** Sort Method: top-N heapsort Sort Space Used: 3256 Sort Space Type: Memory Attached please find yet another small patch which pushes down LIMIT to ForeignScan. I should notice that currently Postgres optimizer is using "Merge Append" and fetches from remote nodes only required number of tuples. So even without LIMIT push down, postgres_fdw will not pull the whole table from remote host. postgres_fdw is using cursor for fetching data from remote. Default fetch size is 100, so even without limit remote query will fetch no more than 100 rows at remote site. Assume the following example: postgres=# create extension postgres_fdw; postgres=# create server shard1 FOREIGN DATA WRAPPER postgres_fdw options(dbname 'postgres', host 'localhost', port '5432'); postgres=# create server shard2 FOREIGN DATA WRAPPER postgres_fdw options(dbname 'postgres', host 'localhost', port '5432'); postgres=# CREATE USER MAPPING for $user SERVER shard1 options (user '$user'); postgres=# CREATE USER MAPPING for $user SERVER shard1 options (user '$user'); postgres=# CREATE TABLE t(u integer primary key, v integer); postgres=# CREATE TABLE t1(u integer primary key, v integer); postgres=# CREATE TABLE t2(u integer primary key, v integer); postgres=# insert into t1 values (generate_series(1,10), random()*10); postgres=# insert into t2 values (generate_series(1,10), random()*10); postgres=# CREATE FOREIGN TABLE t_fdw1() inherits (t) server shard1 options(table_name 't1'); postgres=# CREATE FOREIGN TABLE t_fdw2() inherits (t) server shard2 options(table_name 't2'); postgres=# explain analyze select * from t order by u limit 1; QUERY PLAN --- Limit (cost=200.15..200.20 rows=1 width=8) (actual time=2.010..2.010 rows=1 loops=1) -> Merge Append (cost=200.15..449.39 rows=5121 width=8) (actual time=2.009..2.009 rows=1 loops=1) Sort Key: t.u -> Index Scan using t_pkey on t (cost=0.12..8.14 rows=1 width=8) (actual time=0.005..0.005 rows=0 loops=1) -> Foreign Scan on t_fdw2 (cost=100.00..193.92 rows=2560 width=8) (actual time=1.074..1.074rows=1 loops=1) -> Foreign Scan on t_fdw1 (cost=100.00..193.92 rows=2560 width=8) (actual time=0.928..0.928rows=1 loops=1) Planning time: 0.769 ms Execution time: 6.837 ms (8 rows) As you can see foreign scan fetches only one row from each remote node. But still pushing down limit can have positive effect on performance, especially if SORT can be replaced with TOP-N. I got the following results (time in seconds): Query original limit push down select * from t order by u limit 1 2.276 1.777 select * from t order by v limit 1 100 42 There is index for "u", so fetching records with smallest "u" values can be done without sorting, so times are similar. But in case of sorting by "v", pushing down limit allows to use TOP-1 instead of global sort and it reduces query execution time more than 2 times. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c index 080cb0a..e3847ce 100644 --- a/contrib/postgres_fdw/postgres_fdw.c +++ b/contrib/postgres_fdw/postgres_fdw.c @@ -2949,7 +2949,8 @@ create_cursor(ForeignScanState *node) initStringInfo(&buf); appendStringInfo(&buf, "DECLARE c%u CURSOR FOR\n%s",
[HACKERS] Cached plans and statement generalization
Hi hackers, There were a lot of discussions about query plan caching in hackers mailing list, but I failed to find some clear answer for my question and the current consensus on this question in Postgres community. As far as I understand current state is the following: 1. We have per-connection prepared statements. 2. Queries executed inside plpgsql code are implicitly prepared. It is not always possible or convenient to use prepared statements. For example, if pgbouncer is used to perform connection pooling. Another use case (which is actually the problem I am trying to solve now) is partitioning. Efficient execution of query to partitioned table requires hardcoded value for partitioning key. Only in this case optimizer will be able to construct efficient query plan which access only affected tables (partitions). My small benchmark for distributed partitioned table based on pg_pathman + postgres_fdw shows 3 times degrade of performance in case of using prepared statements. But without prepared statements substantial amount of time is spent in query compilation and planning. I was be able to speed up benchmark more than two time by sending prepared queries directly to the remote nodes. So what I am thinking now is implicit query caching. If the same query with different literal values is repeated many times, then we can try to generalize this query and replace it with prepared query with parameters. I am not considering now shared query cache: is seems to be much harder to implement. But local caching of generalized queries seems to be not so difficult to implement and requires not so much changes in Postgres code. And it can be useful not only for sharding, but for many other cases where prepared statements can not be used. I wonder if such option was already considered and if it was for some reasons rejected: can you point me at this reasons? -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cached plans and statement generalization
On 24.04.2017 13:24, Alexander Korotkov wrote: Hi, Konstantin! On Mon, Apr 24, 2017 at 11:46 AM, Konstantin Knizhnik mailto:k.knizh...@postgrespro.ru>> wrote: There were a lot of discussions about query plan caching in hackers mailing list, but I failed to find some clear answer for my question and the current consensus on this question in Postgres community. As far as I understand current state is the following: 1. We have per-connection prepared statements. 2. Queries executed inside plpgsql code are implicitly prepared. It is not always possible or convenient to use prepared statements. For example, if pgbouncer is used to perform connection pooling. Another use case (which is actually the problem I am trying to solve now) is partitioning. Efficient execution of query to partitioned table requires hardcoded value for partitioning key. Only in this case optimizer will be able to construct efficient query plan which access only affected tables (partitions). My small benchmark for distributed partitioned table based on pg_pathman + postgres_fdw shows 3 times degrade of performance in case of using prepared statements. But without prepared statements substantial amount of time is spent in query compilation and planning. I was be able to speed up benchmark more than two time by sending prepared queries directly to the remote nodes. I don't think it's correct to ask PostgreSQL hackers about problem which arises with pg_pathman while pg_pathman is an extension supported by Postgres Pro. Since we have declarative partitioning committed to 10, I think that community should address this issue in the context of declarative partitioning. However, it's unlikely we can spot this issue with declarative partitioning because it still uses very inefficient constraint exclusion mechanism. Thus, issues you are writing about would become visible on declarative partitioning only when constraint exclusion would be replaced with something more efficient. Long story short, could you reproduce this issue without pg_pathman? Sorry, I have mentioned pg_pathman just as example. The same problems takes place with partitioning based on standard Postgres inheritance mechanism (when I manually create derived tables and specify constraints for them). I didn't test yet declarative partitioning committed to 10, but I expect the that it will also suffer from this problem (because is based on inheritance). But as I wrote, I think that the problem with plan caching is wider and is not bounded just to partitioning. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com <http://www.postgrespro.com/> The Russian Postgres Company -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Cached plans and statement generalization
On 24.04.2017 21:43, Andres Freund wrote: Hi, On 2017-04-24 11:46:02 +0300, Konstantin Knizhnik wrote: So what I am thinking now is implicit query caching. If the same query with different literal values is repeated many times, then we can try to generalize this query and replace it with prepared query with parameters. That's not actuall all that easy: - You pretty much do parse analysis to be able to do an accurate match. How much overhead is parse analysis vs. planning in your cases? - The invalidation infrastructure for this, if not tied to to fully parse-analyzed statements, is going to be hell. - Migrating to parameters can actually cause significant slowdowns, not nice if that happens implicitly. Well, first of all I want to share results I already get: pgbench with default parameters, scale 10 and one connection: protocol TPS simple 3492 extended 2927 prepared 6865 simple + autoprepare 6844 So autoprepare is as efficient as explicit prepare and can increase performance almost two times. My current implementation is replacing with parameters only string literals in the query, i.e. select * from T where x='123'; -> select * from T where x=$1; It greatly simplifies matching of parameters - it is just necessary to locate '\'' character and then correctly handle pairs of quotes. Handling of integer and real literals is really challenged task. One source of problems is negation: it is not so easy to correctly understand whether minus should be treated as part of literal or as operator: (-1), (1-1), (1-1)-1 Another problem is caused by using integer literals in context where parameters can not be used, for example "order by 1". Fully correct substitution can be done by first performing parsing the query, then transform parse tree, replacing literal nodes with parameter nodes and finally deparse tree into generalized query. postgres_fdw already contains such deparse code. It can be moved to postgres core and reused for autoprepare (and may be somewhere else). But in this case overhead will be much higher. I still think that query parsing time is significantly smaller than time needed for building and optimizing query execution plan. But it should be measured if community will be interested in such approach. There is obvious question: how I managed to get this pgbench results if currently only substitution of string literals is supported and queries constructed by pgbench don't contain string literals? I just made small patch in pgbench replaceVariable method wrapping value's representation in quotes. It has almost no impact on performance (3482 TPS vs. 3492 TPS), but allows autoprepare to deal with pgbench queries. I attached my patch to this mail. It is just first version of the patch (based on REL9_6_STABLE branch) just to illustrate proposed approach. I will be glad to receive any comments and if such optimization is considered to be useful, I will continue work on this patch. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index f6be98b..6291d66 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -96,8 +96,6 @@ int max_stack_depth = 100; /* wait N seconds to allow attach from a debugger */ int PostAuthDelay = 0; - - /* * private variables * @@ -188,6 +186,7 @@ static bool IsTransactionStmtList(List *parseTrees); static void drop_unnamed_stmt(void); static void SigHupHandler(SIGNAL_ARGS); static void log_disconnections(int code, Datum arg); +static bool exec_cached_query(const char *query_string); /* @@ -916,6 +915,14 @@ exec_simple_query(const char *query_string) drop_unnamed_stmt(); /* + * Try to find cached plan + */ + if (autoprepare_threshold != 0 && exec_cached_query(query_string)) + { + return; + } + + /* * Switch to appropriate context for constructing parsetrees. */ oldcontext = MemoryContextSwitchTo(MessageContext); @@ -4500,3 +4509,566 @@ log_disconnections(int code, Datum arg) port->user_name, port->database_name, port->remote_host, port->remote_port[0] ? " port=" : "", port->remote_port))); } + + + +typedef struct { + char const* query; + int64 exec_count; + CachedPlanSource* plan; + int n_params; + int16 format; +} plan_cache_entry; + + +/* + * Replace string literals with parameters. We do not consider integer or real literals to avoid problems with + * negative number, user defined operators, ... For example it is not easy to distinguish cases (-1), (1-1), (1-1)-1 + */ +static void generalize_statement(const char *query_string, char** gen_query, char** query_params, int* n_params) +{ + size_t
Re: [HACKERS] Cached plans and statement generalization
On 25.04.2017 19:12, Serge Rielau wrote: On Apr 25, 2017, at 8:11 AM, Konstantin Knizhnik mailto:k.knizh...@postgrespro.ru>> wrote: Another problem is caused by using integer literals in context where parameters can not be used, for example "order by 1”. You will also need to deal with modifiers in types such as VARCHAR(10). Not sure if there are specific functions which can only deal with literals (?) as well. Sorry, I do not completely understand how presence of type modifiers can affect string literals used in query. Can you provide me some example? Doug Doole did this work in DB2 LUW and he may be able to point to more places to watch out for semantically. Generally, in my experience, this feature is very valuable when dealing with (poorly designed) web apps that just glue together strings. I do not think that this optimization will be useful only for poorly designed application. I already pointed on two use cases where prepapred statements can not be used: 1. pgbouncer without session-level pooling. 2. partitioning Protecting it under a GUC would allow to only do the work if it’s deemed likely to help. Another rule I find useful is to abort any efforts to substitute literals if any bind variable is found in the query. That can be used as a cue that the author of the SQL left the remaining literals in on purpose. A follow up feature would be to formalize different flavors of peeking. I.e. can you produce a generic plan, but still recruit the initial set of bind values/substituted literals to dos costing? Here situation is the same as for explicitly prepared statements, isn't it? Sometimes it is preferrable to use specialized plan rather than generic plan. I am not sure if postgres now is able to do it. Cheers Serge Rielau Salesforce.com <http://salesforce.com> PS: FWIW, I like this feature. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Cached plans and statement generalization
On 04/25/2017 07:54 PM, David Fetter wrote: On Tue, Apr 25, 2017 at 06:11:09PM +0300, Konstantin Knizhnik wrote: On 24.04.2017 21:43, Andres Freund wrote: Hi, On 2017-04-24 11:46:02 +0300, Konstantin Knizhnik wrote: So what I am thinking now is implicit query caching. If the same query with different literal values is repeated many times, then we can try to generalize this query and replace it with prepared query with parameters. That's not actuall all that easy: - You pretty much do parse analysis to be able to do an accurate match. How much overhead is parse analysis vs. planning in your cases? - The invalidation infrastructure for this, if not tied to to fully parse-analyzed statements, is going to be hell. - Migrating to parameters can actually cause significant slowdowns, not nice if that happens implicitly. Well, first of all I want to share results I already get: pgbench with default parameters, scale 10 and one connection: protocol TPS simple 3492 extended 2927 prepared 6865 simple + autoprepare 6844 If this is string mashing on the unparsed query, as it appears to be, it's going to be a perennial source of security issues. Sorry, may be I missed something, but I can not understand how security can be violated by extracting string literals from query. I am just copying bytes from one buffer to another. I do not try to somehow interpret this parameters. What I am doing is very similar with standard prepared statements. And moreover query is parsed! Only query which was already parsed and executed (but with different values of parameters) can be autoprepared. Best, David. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cached plans and statement generalization
On 04/25/2017 08:09 PM, Serge Rielau wrote: On Tue, Apr 25, 2017 at 9:45 AM, Konstantin Knizhnik wrote: On 25.04.2017 19:12, Serge Rielau wrote: On Apr 25, 2017, at 8:11 AM, Konstantin Knizhnik mailto:k.knizh...@postgrespro.ru>> wrote: Another problem is caused by using integer literals in context where parameters can not be used, for example "order by 1”. You will also need to deal with modifiers in types such as VARCHAR(10). Not sure if there are specific functions which can only deal with literals (?) as well. Sorry, I do not completely understand how presence of type modifiers can affect string literals used in query. Can you provide me some example? SELECT ‘hello’::CHAR(10) || ‘World’, 5 + 6; You can substitute ‘hello’, ‘World’, 5, and 6. But not 10. I am substituting only string literals. So the query above will be transformed to SELECT $1::CHAR(10) || $2, 5 + 6; What's wrong with it? Also some OLAP syntax like “rows preceding” It pretty much boils down to whether you can do some shallow parsing rather than expending the effort to build the parse tree. Cheers Serge -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Cached plans and statement generalization
On 04/25/2017 11:40 PM, Serge Rielau wrote: On Apr 25, 2017, at 1:37 PM, Konstantin Knizhnik mailto:k.knizh...@postgrespro.ru>> wrote: SELECT ‘hello’::CHAR(10) || ‘World’, 5 + 6; You can substitute ‘hello’, ‘World’, 5, and 6. But not 10. I am substituting only string literals. So the query above will be transformed to SELECT $1::CHAR(10) || $2, 5 + 6; What's wrong with it? Oh, well that leaves a lot of opportunities on the table, doesn’t it? Well, actually my primary intention was not to make badly designed programs (not using prepared statements) work faster. I wanted to address cases when it is not possible to use prepared statements. If we want to substitute with parameters as much literals as possible, then parse+deparse tree seems to be the only reasonable approach. I will try to implement it also, just to estimate parsing overhead. Cheers Serge -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Cached plans and statement generalization
On 26.04.2017 00:47, Andres Freund wrote: On 2017-04-25 21:11:08 +, Doug Doole wrote: When I did this in DB2, I didn't use the parser - it was too expensive. I just tokenized the statement and used some simple rules to bypass the invalid cases. For example, if I saw the tokens "ORDER" and "BY" then I'd disallow replacement replacement until I hit the end of the current subquery or statement. How did you manage plan invalidation and such? The same mechanism as for prepared statements. Cached plans are linked in the list by SaveCachedPlan function and are invalidated by PlanCacheRelCallback. - Andres -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cached plans and statement generalization
On 26.04.2017 01:34, Andres Freund wrote: Hi, (FWIW, on this list we don't do top-quotes) On 2017-04-25 22:21:22 +, Doug Doole wrote: Plan invalidation was no different than for any SQL statement. DB2 keeps a list of the objects the statement depends on. If any of the objects changes in an incompatible way the plan is invalidated and kicked out of the cache. I suspect what is more interesting is plan lookup. DB2 has something called the "compilation environment". This is a collection of everything that impacts how a statement is compiled (SQL path, optimization level, etc.). Plan lookup is done using both the statement text and the compilation environment. So, for example, if my path is DOUG, MYTEAM, SYSIBM and your path is ANDRES, MYTEAM, SYSIBM we will have different compilation environments. If we both issue "SELECT * FROM T" we'll end up with different cache entries even if T in both of our statements resolves to MYTEAM.T. If I execute "SELECT * FROM T", change my SQL path and then execute "SELECT * FROM T" again, I have a new compilation environment so the second invocation of the statement will create a new entry in the cache. The first entry is not kicked out - it will still be there for re-use if I change my SQL path back to my original value (modulo LRU for cache memory management of course). It's not always that simple, at least in postgres, unless you disregard search_path. Consider e.g. cases like CREATE SCHEMA a; CREATE SCHEMA b; CREATE TABLE a.foobar(somecol int); SET search_patch = 'b,a'; SELECT * FROM foobar; CREATE TABLE b.foobar(anothercol int); SELECT * FROM foobar; -- may not be cached plan from before! it sounds - my memory of DB2 is very faint, and I never used it much - like similar issues could arise in DB2 too? There is the same problem with explicitly prepared statements, isn't it? Certainly in case of using prepared statements it is responsibility of programmer to avoid such collisions. And in case of autoprepare programmer it is hidden from programming. But there is guc variable controlling autoprepare feature and by default it is switched off. So if programmer or DBA enables it, then them should take in account effects of such decision. By the way, isn't it a bug in PostgreSQL that altering search path is not invalidating cached plans? As I already mentioned, the same problem can be reproduced with explicitly prepared statements. Greetings, Andres Freund -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cached plans and statement generalization
On 26.04.2017 04:00, Tsunakawa, Takayuki wrote: From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Konstantin Knizhnik Well, first of all I want to share results I already get: pgbench with default parameters, scale 10 and one connection: So autoprepare is as efficient as explicit prepare and can increase performance almost two times. This sounds great. BTW, when are the autoprepared statements destroyed? Right now them are destroyed only in case of receiving invalidation message (when catalog is changed). Prepared statements are local to backend and are located in backend's memory. It is unlikely, that there will be too much different queries which cause memory overflow. But in theory such situation is certainly possible. Are you considering some upper limit on the number of prepared statements? In this case we need some kind of LRU for maintaining cache of autoprepared statements. I think that it is good idea to have such limited cached - it can avoid memory overflow problem. I will try to implement it. Regards Takayuki Tsunakawa -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cached plans and statement generalization
On 26.04.2017 10:49, Konstantin Knizhnik wrote: On 26.04.2017 04:00, Tsunakawa, Takayuki wrote: Are you considering some upper limit on the number of prepared statements? In this case we need some kind of LRU for maintaining cache of autoprepared statements. I think that it is good idea to have such limited cached - it can avoid memory overflow problem. I will try to implement it. I attach new patch which allows to limit the number of autoprepared statements (autoprepare_limit GUC variable). Also I did more measurements, now with several concurrent connections and read-only statements. Results of pgbench with 10 connections, scale 10 and read-only statements are below: Protocol TPS extended 87k prepared 209k simple+autoprepare 206k As you can see, autoprepare provides more than 2 times speed improvement. Also I tried to measure overhead of parsing (to be able to substitute all literals, not only string literals). I just added extra call of pg_parse_query. Speed is reduced to 181k. So overhead is noticeable, but still making such optimization useful. This is why I want to ask question: is it better to implement slower but safer and more universal solution? -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index f6be98b..0c9abfc 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -188,6 +188,7 @@ static bool IsTransactionStmtList(List *parseTrees); static void drop_unnamed_stmt(void); static void SigHupHandler(SIGNAL_ARGS); static void log_disconnections(int code, Datum arg); +static bool exec_cached_query(const char *query_string); /* @@ -916,6 +917,14 @@ exec_simple_query(const char *query_string) drop_unnamed_stmt(); /* + * Try to find cached plan + */ + if (autoprepare_threshold != 0 && exec_cached_query(query_string)) + { + return; + } + + /* * Switch to appropriate context for constructing parsetrees. */ oldcontext = MemoryContextSwitchTo(MessageContext); @@ -4500,3 +4509,606 @@ log_disconnections(int code, Datum arg) port->user_name, port->database_name, port->remote_host, port->remote_port[0] ? " port=" : "", port->remote_port))); } + +typedef struct { + char const* query; + dlist_nodelru; + int64 exec_count; + CachedPlanSource* plan; + int n_params; + int16 format; + bool disable_autoprepare; +} plan_cache_entry; + +/* + * Replace string literals with parameters. We do not consider integer or real literals to avoid problems with + * negative number, user defined operators, ... For example it is not easy to distinguish cases (-1), (1-1), (1-1)-1 + */ +static void generalize_statement(const char *query_string, char** gen_query, char** query_params, int* n_params) +{ + size_t query_len = strlen(query_string); + char const* src = query_string; + char* dst; + char* params; + unsigned char ch; + + *n_params = 0; + + *gen_query = (char*)palloc(query_len*2); /* assume that we have less than 1000 parameters, the worst case is replacing '' with $999 */ + *query_params = (char*)palloc(query_len + 1); + dst = *gen_query; + params = *query_params; + + while ((ch = *src++) != '\0') { + if (isspace(ch)) { + /* Replace sequence of whitespaces with just one space */ + while (*src && isspace(*(unsigned char*)src)) { +src += 1; + } + *dst++ = ' '; + } else if (ch == '\'') { + while (true) { +ch = *src++; +if (ch == '\'') { + if (*src != '\'') { + break; + } else { + /* escaped quote */ + *params++ = '\''; + src += 1; + } +} else { + *params++ = ch; +} + } + *params++ = '\0'; + dst += sprintf(dst, "$%d", ++*n_params); + } else { + *dst++ = ch; + } + } + Assert(dst <= *gen_query + query_len); + Assert(params <= *query_params + query_len*2); + *dst = '\0'; +} + +static uint32 plan_cache_hash_fn(const void *key, Size keysize) +{ + return string_hash(((plan_cache_entry*)key)->query, 0); +} + +static int plan_cache_match_fn(const void *key1, const void *key2, Size keysize) +{ + return strcmp(((plan_cache_entry*)key1)->query, ((plan_cache_entry*)key2)->query); +} + +static void* plan_cache_keycopy_fn(void *dest, const void *src, Size keysize) +{ + ((plan_cache_entry*)dest)->query = pstrdup(((plan_cache_entry*)src)->query); +return dest; +} + +#define PLAN_CACHE_SIZE 113 + +size_t nPlanCacheHits; +size_t nPlanCacheMisses; + +/* + * Try to generalize query, find cached plan for it and execute + */ +static bool exec_cached_query(const char *query_string) +{ + Co
Re: [HACKERS] Cached plans and statement generalization
On 04/26/2017 08:08 PM, Doug Doole wrote: A naive option would be to invalidate anything that depends on table or view *.FOOBAR. You could probably make it a bit smarter by also requiring that schema A appear in the path. This has been rumbling around in my head. I wonder if you could solve this problem by registering dependencies on objects which don't yet exist. Consider: CREATE TABLE C.T1(...); CREATE TABLE C.T2(...); SET search_path='A,B,C,D'; SELECT * FROM C.T1, T2; For T1, you'd register a hard dependency on C.T1 and no virtual dependencies since the table is explicitly qualified. For T2, you'd register a hard dependency on C.T2 since that is the table that was selected for the query. You'd also register virtual dependencies on A.T2 and B.T2 since if either of those tables (or views) are created you need to recompile the statement. (Note that no virtual dependency is created on D.T2() since that table would never be selected by the compiler.) The catch is that virtual dependencies would have to be recorded and searched as strings, not OIDs since the objects don't exist. Virtual dependencies only have to be checked during CREATE processing though, so that might not be too bad. But this is getting off topic - I just wanted to capture the idea while it was rumbling around. I think that it will be enough to handle modification of search path and invalidate prepared statements cache in this case. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
[HACKERS] Bug in prepared statement cache invalidation?
Hi hackers, I find out that now Postgres correctly invalidates prepared plans which directly depend on altered relation, but doesn't invalidate plans having transitive (indirect) dependencies. Is it a bug or feature? postgres=# create table foo(x integer); CREATE TABLE postgres=# select * from foo; x --- (0 rows) postgres=# create function returnqueryf()returns setof foo as $$ begin return query select * from foo; end; $$ language plpgsql; CREATE FUNCTION postgres=# select * from returnqueryf(); x --- (0 rows) postgres=# create function returnqueryff()returns setof foo as $$ begin return query select * from returnqueryf(); end; $$ language plpgsql; CREATE FUNCTION postgres=# select * from returnqueryff(); x --- (0 rows) postgres=# alter table foo add column y integer; ALTER TABLE postgres=# select * from foo; x | y ---+--- (0 rows) postgres=# select * from returnqueryf(); x | y ---+--- (0 rows) postgres=# select * from returnqueryff(); ERROR: structure of query does not match function result type DETAIL: Number of returned columns (1) does not match expected column count (2). CONTEXT: PL/pgSQL function returnqueryff() line 1 at RETURN QUERY p -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cached plans and statement generalization
On 26.04.2017 13:46, Pavel Stehule wrote: I attach new patch which allows to limit the number of autoprepared statements (autoprepare_limit GUC variable). Also I did more measurements, now with several concurrent connections and read-only statements. Results of pgbench with 10 connections, scale 10 and read-only statements are below: Protocol TPS extended 87k prepared 209k simple+autoprepare 206k As you can see, autoprepare provides more than 2 times speed improvement. Also I tried to measure overhead of parsing (to be able to substitute all literals, not only string literals). I just added extra call of pg_parse_query. Speed is reduced to 181k. So overhead is noticeable, but still making such optimization useful. This is why I want to ask question: is it better to implement slower but safer and more universal solution? Unsafe solution has not any sense, and it is dangerous (80% of database users has not necessary knowledge). If somebody needs the max possible performance, then he use explicit prepared statements. I attached new patch to this mail. I completely reimplement my original approach and now use parse tree transformation. New pgbench (-S -c 10) results are the following: Protocol TPS extended 87k prepared 209k simple+autoprepare 185k So there is some slowdown comparing with my original implementation and explicitly prepared statements, but still it provide more than two times speed-up comparing with unprepared queries. And it doesn't require to change existed applications. As far as most of real production application are working with DBMS through some connection pool (pgbouncer,...), I think that such optimization will be useful. Isn't it interesting if If we can increase system throughput almost two times by just setting one parameter in configuration file? I also tried to enable autoprepare by default and run regression tests. 7 tests are not passed because of the following reasons: 1. Slightly different error reporting (for example error location is not always identically specified). 2. Difference in query behavior caused by changed local settings (Andres gives an example with search_path, and date test is failed because of changing datestyle). 3. Problems with indirect dependencies (when table is altered only cached plans directly depending on this relation and invalidated, but not plans with indirect dependencies). 4. Not performing domain checks for null values. I do not think that this issues can cause problems for real application. Also it is possible to limit number of autoprepared statements using autoprepare_limit parameter, avoid possible backend memory overflow in case of larger number of unique queries sent by application. LRU discipline is used to drop least recently used plans. Any comments and suggestions for future improvement of this patch are welcome. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/src/backend/nodes/nodeFuncs.c b/src/backend/nodes/nodeFuncs.c index cd39167..4fbc8b7 100644 --- a/src/backend/nodes/nodeFuncs.c +++ b/src/backend/nodes/nodeFuncs.c @@ -3610,6 +3610,454 @@ raw_expression_tree_walker(Node *node, } /* + * raw_expression_tree_mutator --- transform raw parse tree. + * + * This function is implementing slightly different approach for tree update than expression_tree_mutator(). + * Callback is given pointer to pointer to the current node and can update this field instead of returning reference to new node. + * It makes it possible to remember changes and easily revert them without extra traversal of the tree. + * + * This function do not need QTW_DONT_COPY_QUERY flag: it never implicitly copy tree nodes, doing in-place update. + * + * Like raw_expression_tree_walker, there is no special rule about query + * boundaries: we descend to everything that's possibly interesting. + * + * Currently, the node type coverage here extends only to DML statements + * (SELECT/INSERT/UPDATE/DELETE) and nodes that can appear in them, because + * this is used mainly during analysis of CTEs, and only DML statements can + * appear in CTEs. If some other node is visited, iteration is immediately stopped and true is returned. + */ +bool +raw_expression_tree_mutator(Node *node, + bool (*mutator) (), + void *context) +{ + ListCell *temp; + + /* + * The walker has already visited the current node, and so we need only + * recurse into any sub-nodes it has. + */ + if (node == NULL) + return false; + + /* Guard against stack overflow due to overly complex expressions */ + check_stack_depth(); + + switch (nodeTag(node)) + { + case T_SetToDefault: + case T_CurrentOfExpr: + case T_Integer: + case T_Float: + case T_String: + case T_BitString: + case T_Null: + case T_ParamRef: + case T_A