from:"Konstantin Knizhnik"

On 02/27/2016 06:57 AM, Robert Haas wrote:

On Sat, Feb 27, 2016 at 1:49 AM, Konstantin Knizhnik
wrote:

pg_tsdtm is based on another approach: it is using system time as CSN and
doesn't require arbiter. In theory there is no limit for scalability. But
differences in system time and necessity to use more rounds of communication
have negative impact on performance.

How do you prevent clock skew from causing serialization anomalies?

If node receives message from "feature" it just needs to wait until this future
arrive.
Practically we just "adjust" system time in this case, moving it forward
(certainly system time is not actually changed, we just set correction value which need
to be added to system time).
This approach was discussed in the article:
http://research.microsoft.com/en-us/people/samehe/clocksi.srds2013.pdf
I hope, in this article algorithm is explained much better than I can do here.

Few notes:
1. I can not prove that our pg_tsdtm absolutely correctly implements approach
described in this article.
2. I didn't try to formally prove that our implementation can not cause some
serialization anomalies.
3. We just run various synchronization tests (including simplest debit-credit
test which breaks old version of Postgtes-XL) during several days and we didn't
get any inconsistencies.
4. We have tested pg_tsdtm both at single node, blade cluster and geographically distributed nodes (distance more than thousand kilometers: one server was in Vladivostok, another in Kaliningrad). Ping between these two servers takes about 100msec.
Performance of our benchmark drops about 100 times but there was no inconsistencies.

Also I once again want to notice that primary idea of the proposed patch was
not pg_tsdtm.
There are well know limitation of this pg_tsdtm which we will try to address
in future.
What we want is to include XTM API in PostgreSQL to be able to continue our
experiments with different transaction managers and implementing multimaster on
top of it (our first practical goal) without affecting PostgreSQL core.

If XTM patch will be included in 9.6, then we can propose our multimaster as
PostgreSQL extension and everybody can use it.
Otherwise we have to propose our own fork of Postgres which significantly
complicates using and maintaining it.

So there is no ideal solution which can work well for all cluster. This is
why it is not possible to develop just one GTM, propose it as a patch for
review and then (hopefully) commit it in Postgres core. IMHO it will never
happen. And I do not think that it is actually needed. What we need is a way
to be able to create own transaction managers as Postgres extension not
affecting its core.

This seems rather defeatist. If the code is good and reliable, why
should it not be committed to core?

Two reasons:
1. There is no ideal implementation of DTM which will fit all possible needs
and be efficient for all clusters.
2. Even if such implementation exists, still the right way of it integration is
Postgres should use kind of TM API.
I hope that everybody will agree that doing it in this way:

#ifdef PGXC
/* In Postgres-XC, stop timestamp has to follow the timeline of GTM */
xlrec.xact_time = xactStopTimestamp + GTMdeltaTimestamp;
#else
xlrec.xact_time = xactStopTimestamp;
#endif

or in this way:

xlrec.xact_time = xactUseGTM ? xactStopTimestamp + GTMdeltaTimestamp :
xactStopTimestamp;

is very very bad idea.
In OO programming we should have abstract TM interface and several
implementations of this interface, for example
MVCC_TM, 2PL_TM, Distributed_TM...
This is actually what can be done with our XTM API.
As far as Postgres is implemented in C, not in C++ we have to emulate
interfaces using structures with function pointers.
And please notice that there is completely no need to include DTM
implementation in core, as far as it is not needed for everybody.
It can be easily distributed as extension.

I have that quite soon we can propose multimaster extension which should provides functionality similar with MySQL Gallera. But even right now we have integrated pg_dtm and pg_tsdtm with pg_shard and postgres_fdw, allowing to provide distributed
consistency for them.

All arguments against XTM can be applied to any other extension API in
Postgres, for example FDW.
Is it general enough? There are many useful operations which currently are
not handled by this API. For example performing aggregation and grouping at
foreign server side. But still it is very useful and flexible mechanism,
allowing to implement many wonderful things.

That is true. And everybody is entitled to an opinion on each new
proposed hook, as to whether that hook is general or not. We have
both accepted and rejected proposed hooks in the past.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pg

Re: [HACKERS] Relation cache invalidation on replica

On 02/27/2016 04:16 AM, Simon Riggs wrote:

On 27 February 2016 at 00:33, Simon Riggs mailto:si...@2ndquadrant.com>> wrote:

On 27 February 2016 at 00:29, Andres Freund mailto:and...@anarazel.de>> wrote:

On 2016-02-26 18:05:55 +0300, Konstantin Knizhnik wrote:
> The reason of the problem is that invalidation messages are not 
delivered to
> replica after the end of concurrent create index.
> Invalidation messages are included in xlog as part of transaction 
commit
> record.
> Concurrent index create is split into three transaction, last of 
which is
> just performing inplace update of index tuple, marking it as valid and
> invalidating cache. But as far as this transaction is not assigned 
XID, no
> transaction record is created in WAL and send to replicas. As a 
result,
> replica doesn't receive this invalidation messages.

Ugh, that's a fairly ugly bug.

Looking now.

If the above is true, then the proposed fix wouldn't work either.

No point in sending a cache invalidation message on the standby if you haven't 
also written WAL, since the catalog re-read would just see the old row.

heap_inplace_update() does write WAL, which blows away the starting premise.

So I'm not seeing this as an extant bug in an open source version of 
PostgreSQL, in my current understanding.

Inplace update really creates record in WAL and this is why index is marked as 
valid at replica.
But invalidation messages are sent only with transaction commit record and such 
record is not created in this case,
because there is no assigned XID.

This is a real bug which originally observed by one of our customers with 
different versions of Postgres (last one them have tried was 9.5.1).
Then we reproduced it locally and determined the reason of the problem.
Repro scenario is very simple: you just need to create large enough table 
(pgbench with scale factor 100 works well in my case)
so that "create index concurrently" takes substantial amount of time. If, while 
this statement is in progress, you will execute some query at replica which
uses this index, then it will cache state of relation without index. And after 
even when index is actually constructed, it will never be used in this backend 
(but other backends at replica will use it).

I am not sure about the best way of fixing the problem.
I have not tested Andreas proposal:

if (nrels != 0 || nmsgs != 0 || RelcacheInitFileInval)

if it actually fixes the problem.
Assigning XID in heap_inplace_update definitely should work.
It is better than forcing assignment XID in DefineIndex? I am not sure, because 
this problem seems to be related only with concurrent update
(but may be I am wrong).
At least not all inplace updates should cause catalog invalidation and so 
require XID assignment.

--
Simon Riggs http://www.2ndQuadrant.com/ <http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] The plan for FDW-based sharding

2016-02-27 Thread Konstantin Knizhnik


On 02/27/2016 06:54 AM, Robert Haas wrote:

On Fri, Feb 26, 2016 at 10:56 PM, Konstantin Knizhnik
 wrote:

We do not have formal prove that proposed XTM is "general enough" to handle
all possible transaction manager implementations.
But there are two general ways of dealing with isolation: snapshot based and
CSN  based.

I don't believe that for a minute.  For example, consider this article:


Well, I have to agree that saying that there are just two ways of providing 
distributed isolation I was not right.
There is at least one more method: conservative locking. But it will cause huge 
number of extra network messages which has to be exchanged.
Also I mostly considered solutions compatible with PostgreSQL MVCC model.

And definitely their are other approaches. Like preserving transaction commit 
order (as it is done in Galera).
Some other them can be implemented with XTM (preserving commit order), some - 
not (2PL).
I have already noticed that XTM is not allowing to implement ANY transaction 
manager.
But we have considered several approaches to distributed transaction management 
explained in the article related with really working systems.
Some of them are real production system as SAP HANA, some are just prototypes, 
but working prototypes for which authors have performed
some benchmarking and comparison with other approaches. The references you have 
mentioned are mostly theoretical description of the problem.
Nice to know it but it is hard to build some concrete implementation based on 
this articles.


Briefly answering other your questions:


For example, consider a table with a million rows spread across any number of 
servers.


It is sharding scenario, pg_tsdtm will work well in this case does not 
requiring sending a lot of extra messages.


Now consider another workload where each transaction reads a row one

one server, reads a row on another server,

It can be solved both with pg_dtm (central arbiter) and pg_tsdtm (no arbiter),
But actually you scenarios just once again proves that there can not be just 
one ideal distributed TM.


So maybe the goal for the GTM isn't to provide true serializability

across the cluster but some lesser degree of transaction isolation.
But then exactly which serialization anomalies are we trying to
prevent, and why is it OK to prevent those and not others?

Absolutely agree. There are some theoretical discussion regarding CAP and 
different distributed level of isolation.
But at practice people want to solve their tasks. Most of PostgeSQL used are using 
default isolation level: read committed although there are alot of "wonderful" 
anomalies with it.
Serialazable transaction in Oracle are actually violating fundamental 
serializability rule and still Oracle is one of ther most popular database in 
the world...
The was isolation bug in Postgres-XL which doesn't prevent from using it by 
commercial customers...

So I do not say that discussing all this theoretical questions is not need as 
formally proven correctness of distributed algorithm.
But I do not understand hot why it should prevent from providing extensible TM 
API.
Yes, we can tot do everything with it. But still we can implement many 
different approaches.
I think that it somehow proves that it is "general enough".






 






https://en.wikipedia.org/wiki/Global_serializability

I think the neutrality of that article is *very* debatable, but it
certainly contradicts the idea that snapshots and CSNs are the only
methods of achieving global serializability.

Or consider this lecture:

http://hssl.cs.jhu.edu/~randal/416/lectures.old/ln5.2.pdf

That's a great introduction to the problem we're trying to solve here,
but again, snapshots are not mentioned, and CSNs certainly aren't
mentioned.

This write-up goes further, explaining three different methods for
ensuring global serializability, none of which mention snapshots or
CSNs:

http://heaven.eee.metu.edu.tr/~vision/LectureNotes/EE442/Ee442ch7.html

Actually, I think the second approach is basically a snapshot/CSN-type
approach, but it doesn't use that terminology and the connection to
what you are proposing is very unclear.

I think you're approaching this problem from a viewpoint that is
entirely too focused on the code that exists in PostgreSQL today.
Lots of people have done lots of academic research on how to solve
this problem, and you can't possibly say that CSNs and snapshots are
the only solution to this problem unless you haven't read any of those
papers.  The articles above aren't exceptional in mentioning neither
of the approaches that you are advocating - they are typical of the
literature in this area.  How can it be that the only solutions to
this problem are ones that are totally different from the approaches
that university professors who spend time doing research on
concurrency have spent time exploring?

I think we need to back up here and

Re: [HACKERS] The plan for FDW-based sharding

2016-02-27 Thread Konstantin Knizhnik


Neither pg_dtm, neither pg_tsdtm supports serializable isolation level.
We implemented distributed snapshot isolation - repeatable-read isolation level.
We also do not support read-committed isolation level now.

We do not try to preserve transaction commit order at all nodes.
But in principle it can be implemented using XTM API: it allows to redefine 
function which actually sets transaction status.  pg_dtm performs 2PC here.
And in principle it is possible to enforce commits in any particular order.

Concerning CSNs, may be you are right and it is not correct to use this notion in this 
case. Actually there are many "CSNs" involved in transaction commit.
First of all each transaction is assigned local CSN (timestamp) when it is 
ready to commit. Then CSNs of all nodes are exchanged and maximal CSN is chosen.
This maximum is writen as final transaction CSN and is used in visibility check.


On 02/27/2016 01:48 AM, Kevin Grittner wrote:

On Fri, Feb 26, 2016 at 2:19 PM, Konstantin Knizhnik
 wrote:


pg_tsdtm  is based on another approach: it is using system time
as CSN

Which brings up an interesting point, if we want logical
replication to be free of serialization anomalies for those using
serializable transactions, we need to support applying transactions
in an order which may not be the same as commit order -- CSN (as
such) would be the wrong thing.  If serializable transaction 1 (T1)
modifies a row and concurrent serializable transaction 2 (T2) reads
the old version of the row, and modifies something based on that,
T2 must be applied to a logical replica first even if T1 commits
before it; otherwise the logical replica could see a state not
consistent with business rules and which could not have been seen
(due to SSI) on the source database.  Any DTM API which does not
support some mechanism to rearrange the order of transactions from
commit order to some other order (based on, for example, read-write
dependencies) is not complete.  If it does support that, it gives
us a way forward for presenting consistent data on logical
replicas.

To avoid confusion, it might be best to reserve CSN for actual
commit sequence numbers, or at least values which increase
monotonically with each commit.  The term of art for what I
described above is "apparent order of execution", so maybe we want
to use AOE or AOoE for the order we choose to use in a particular
implementation.  It doesn't seem to me to be outright inaccurate
for cases where the system time on the various systems is used.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] The plan for FDW-based sharding

2016-02-28 Thread Konstantin Knizhnik


On 02/27/2016 11:38 PM, Kevin Grittner wrote:


Is this an implementation of some particular formal technique?  If
so, do you have a reference to a paper on it?  I get the sense that
there has been a lot written about distributed transactions, and
that it would be a mistake to ignore it, but I have not (yet)
reviewed the literature for it.


The reference to the article is at our WiKi pages explaining our DTM: 
https://wiki.postgresql.org/wiki/DTM

http://research.microsoft.com/en-us/people/samehe/clocksi.srds2013.pdf

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] The plan for FDW-based sharding

Thank you very much for you comments.

On 01.03.2016 18:19, Robert Haas wrote:

On Sat, Feb 27, 2016 at 2:29 AM, Konstantin Knizhnik
wrote:

How do you prevent clock skew from causing serialization anomalies?

If node receives message from "feature" it just needs to wait until this
future arrive.
Practically we just "adjust" system time in this case, moving it forward
(certainly system time is not actually changed, we just set correction value
which need to be added to system time).
This approach was discussed in the article:
http://research.microsoft.com/en-us/people/samehe/clocksi.srds2013.pdf
I hope, in this article algorithm is explained much better than I can do
here.

Hmm, the approach in that article is very interesting, but it sounds
different than what you are describing - they do not, AFAICT, have
anything like a "correction value"

In the article them used anotion "wait":

if T.SnapshotTime>GetClockTime()
then wait until T.SnapshotTimeOriginally we really do sleep here, but then we think that instead of
sleeping we can just adjust local time.
Sorry, I do not have format prove it is equivalent but... at least we
have not encountered any inconsistencies after this fix and performance
is improved.

There are well know limitation of this pg_tsdtm which we will try to
address in future.

How well known are those limitations? Are they documented somewhere?
Or are they only well-known to you?

Sorry, well know for us.
But them are described at DTM wiki page.
Right now pg_tsdtm is not supporting correct distributed deadlock
detection (is not building global lock graph) and is detecting
distributed deadlocks just based on timeouts.
It doesn't support explicit locks but "select for update" will work
correctly.

What we want is to include XTM API in PostgreSQL to be able to continue our
experiments with different transaction managers and implementing multimaster
on top of it (our first practical goal) without affecting PostgreSQL core.

Well I still think what I said before is valid. If the code is good,
let it be a core submission. If it's not ready yet, submit it to core
when it is. If it can't be made good, forget it.

I have nothing against committing DTM code in core. But still the best
way of integration it is to use a-la-OO approach.
So still need API. Inserting if-s or switches in existed code is IMHO
ugly idea.

Also it is not enough for DTM code to be just "good". It should provide
expected functionality.
But which functionality is expected? From my experience of development
different cluster solutions I can say that
different customers have very different requirements. It is very hard if
ever possible to satisfy them all.

Right now I do not feel that I can predict all possible requirements to DTM.
This is why we want to provide some API, propose some implementations of
this API, receive feedbecks and get better understanding which
functionality is actually needed by customers.

This seems rather defeatist. If the code is good and reliable, why
should it not be committed to core?

Two reasons:
1. There is no ideal implementation of DTM which will fit all possible needs
and be efficient for all clusters.

Hmm, what is the reasoning behind that statement? I mean, it is
certainly true that there are some places where we have decided that
one-size-fits-all is not the right approach. Indexing, for example.
But there are many other places where we have not chosen to make
things pluggable, and that I don't think it should be taken for
granted that plugability is always an advantage.

I fear that building a DTM that is fully reliable and also
well-performing is going to be really hard, and I think it would be
far better to have one such DTM that is 100% reliable than two or more
implementations each of which are 99% reliable.

The question is not about it's reliability, but mostly about its
functionality and flexibility.

2. Even if such implementation exists, still the right way of it integration
is Postgres should use kind of TM API.

Sure, APIs are generally good, but that doesn't mean *this* API is good.

Well, I do not what to say "better than nothing", but I find this API to
be a reasonable compromise between flexibility and minimization of
changes in PostgreSQL core. If you have some suggestions how to improve
it, I will be glad to receive them.

I hope that everybody will agree that doing it in this way:

#ifdef PGXC
/* In Postgres-XC, stop timestamp has to follow the timeline of GTM
*/
xlrec.xact_time = xactStopTimestamp + GTMdeltaTimestamp;
#else
xlrec.xact_time = xactStopTimestamp;

Re: [HACKERS] The plan for FDW-based sharding




On 01.03.2016 19:03, Robert Haas wrote:

On Tue, Mar 1, 2016 at 10:37 AM, Bruce Momjian  wrote:

On Tue, Mar  1, 2016 at 10:19:45AM -0500, Robert Haas wrote:

Two reasons:
1. There is no ideal implementation of DTM which will fit all possible needs
and be  efficient for all clusters.

Hmm, what is the reasoning behind that statement?  I mean, it is
certainly true that there are some places where we have decided that
one-size-fits-all is not the right approach.  Indexing, for example.

Uh, is that even true of indexing?  While the plug-in nature of indexing
allows for easier development and testing, does anyone create plug-in
indexing that isn't shipped by us?  I thought WAL support was something
that prevented external indexing solutions from working.

True.  There is an API, though, and having pluggable WAL support seems
desirable too.  At the same time, I don't think we know of anyone
maintaining a non-core index AM ... and there are probably good
reasons for that.  We end up revising the index AM API pretty
regularly every time somebody wants to do something new, so it's not
really a stable API that extensions can just tap into.  I suspect that
a transaction manager API would end up similarly situated.



IMHO non-stable API is better than lack of API.
Just because it makes it possible to implement features in modular way.
And refactoring of API is not so difficult thing...


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] The plan for FDW-based sharding

On 03/01/2016 09:19 PM, Petr Jelinek wrote:

Since this thread heavily discusses the XTM, I have question about the XTM as proposed because one thing is very unclear to me - what happens when user changes the XTM plugin on the server? I didn't see any xid handover API which makes me wonder if
changes of a plugin (or for example failure to load previously used plugin due to admin error) will send server to similar situation as xid wraparound.

Transaction manager is very "intimate" part of DBMS and certainly bugs and
problems in custom TM implementation can break the server.
So if you are providing custom TM implementation, you should take full
responsibility on system integrity.
XTM API itself doesn't enforce any XID handling policy. As far as we do not
want to change tuple header format, XID is still 32-bit integer.

In case of pg_dtm, global transactions at all nodes are assigned the same XID
by arbiter. Arbiter is handling XID wraparound.
In pg_tsdtm each node maintains its own XIDs, actually pg_tsdtm doesn't change way of assigning CIDs by Postgres. So wraparound in this case is handled in standard way. Instead of assigning own global XIDs, pg_tsdtm provides mapping between local XIDs and
global CSNs. Visibility checking rules looks on CSNs, not on XIDs.

In both cases if system is for some reasons restarted and DTM plugin failed to
be loaded, you can still access database locally. No data can be lost.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PROPOSAL: Fast temporary tables

As far as I know we are trying to kill two birds with one stone:
1. Reduce overhead of accessing temporary tables
2. Make it possible to create temporary tables on replica.

Replicas with hot-standby are widely used for running read-only OLAP queries.
But such queries usually stores intermediate results in temporary tables.
Unfortunately creating temporary table at read-only replica is impossible now.
So some customers do the following tricks: them create pool of file FDWs at
master and then use them at replicas.
But IMHO it is ugly and inefficient hack.

Ideally we should be able to create temporary tables at replica, not affecting
system catalog.
But there are a lot of problems: where it should be stores, how to assign XIDs
to the ruples inserted in temporary table,...

Unfortunately, looks like there is no simple solution of the problem.
The 100% solution is multimaster (which we are currently developing), but it is
completely different story...

On 03/01/2016 10:17 PM, Jim Nasby wrote:

On 3/1/16 10:05 AM, Atri Sharma wrote:

Fair point, that means inventing a whole new OID generation structure..

Generation is just the tip of the iceberg. You still need the equivalent to
foreign keys (ie: pg_depend). While you would never have a permanent object
depend on a temp object, the reverse certainly needs to be supported.

If I were attempting to solve this at a SQL level, I'd be thinking about using table inheritance such that the permanent objects are stored in a permanent parent. New backends would create UNLOGGED children off of that parent. There would be a pid column
that was always NULL in the parent, but populated in children. That means children could use their own local form of an OID. When a backend terminates you'd just truncate all it's tables.

Actually translating that into relcache and everything else would be a serious
amount of work.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] The plan for FDW-based sharding

2016-03-02 Thread Konstantin Knizhnik




On 01.03.2016 22:02, Bruce Momjian wrote:

On Tue, Mar  1, 2016 at 07:56:58PM +0100, Petr Jelinek wrote:

Note that I am not saying that other discussed approaches are any
better, I am saying that we should know approximately what we
actually want and not just beat FDWs with a hammer and hope sharding
will eventually emerge and call that the plan.

I will say it again --- FDWs are the only sharding method I can think of
that has a chance of being accepted into Postgres core.  It is a plan,
and if it fails, it fails.  If is succeeds, that's good.  What more do
you want me to say?  I know of no other way to answer the questions you
asked above.


I do not understand why it can fail.
FDW approach may be not flexible enough for building optimal distributed 
query execution plans for complex OLAP queries.
But for simple queries it should work fine. Simple queries corresponds  
OLTP and simple OLAP.
For OLTP we definitely need transaction manager to provide global 
consistency.
And we have actually prototype of integration postgres_fdw with out 
pg_dtm and pg_tsdtm transaction managers.

The results are quite IMHO promising (see attached diagram).

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



DTM-pgconf.pdf
Description: Adobe PDF document

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] The plan for FDW-based sharding

2016-03-07 Thread Konstantin Knizhnik

scenarios.
It is especially true for DTM, because requirements of various cluster solution 
are very different.
And the most convenient way of doing it is to ship DTM as extension, not as 
some fork of Postgres.
It will greatly simplify using it.



Now, on connection pooling, I am similarly not opposed to
having some well-designed hooks, but I also think in the long run it
would be better for some improvements in this area to be part of core.
None of that means I would support any particular hook proposal, of
course.




--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Optimizer questions

2016-03-08 Thread Konstantin Knizhnik


On 03/08/2016 07:01 AM, Tom Lane wrote:

Konstantin Knizhnik  writes:

Attached please find improved version of the optimizer patch for LIMIT clause.

This patch isn't anywhere close to working after 3fc6e2d7f5b652b4.
(TBH, the reason I was negative about this upthread is that I had that
one in the oven and knew it would conflict spectacularly.)  I encourage
you to think about how an optimization of this sort could be made to
work in a non-kluge fashion in the new code structure.

I've not spent a lot of time on this, but I think maybe what would make
sense is to consider both the case where function calculations are
postponed to after ORDER BY and the case where they aren't, and generate
Paths for both.  Neither approach is a slam-dunk win.  For example,
suppose that one of the tlist columns is md5(wide_column) --- it will
likely not be preferable to pass the wide column data through the sort
step rather than reducing it to a hash first.  This would require some
work in grouping_planner to track two possible pathtargets, and work in
create_ordered_paths to generate paths for both possibilities.  A possible
objection is that this would add planning work even when no real benefit
is possible; so maybe we should only consider the new way if the tlist has
significant eval cost?  Not sure about that.  There is also something
to be said for the idea that we should try to guarantee consistent
semantics when the tlist contains volatile functions.

For now, I've set this commitfest entry to Waiting on Author.  There's
still time to consider a rewrite in this 'fest, if you can get it done
in a week or two.

regards, tom lane


Attached please find rebased patch.
Unfortunately 3fc6e2d7f5b652b4 still doesn't fix the problem with "lazy" 
evaluation of target list.
This is why my patch is still useful. But frankly speaking I am not sure that 
it is best way of fixing this problem,
because it takes in account only one case: sort+limit. May be the same 
optimization can be useful for other queries.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 5fc8e5b..709d1ad 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -126,8 +126,9 @@ static RelOptInfo *create_ordered_paths(PlannerInfo *root,
 	 RelOptInfo *input_rel,
 	 double limit_tuples);
 static PathTarget *make_scanjoin_target(PlannerInfo *root, List *tlist,
-	 AttrNumber **groupColIdx);
+		AttrNumber **groupColIdx, bool* splitted_projection);
 static int	get_grouping_column_index(Query *parse, TargetEntry *tle);
+static int	get_sort_column_index(Query *parse, TargetEntry *tle);
 static List *postprocess_setop_tlist(List *new_tlist, List *orig_tlist);
 static List *select_active_windows(PlannerInfo *root, WindowFuncLists *wflists);
 static List *make_windowInputTargetList(PlannerInfo *root,
@@ -1381,6 +1382,7 @@ grouping_planner(PlannerInfo *root, bool inheritance_update,
 	RelOptInfo *current_rel;
 	RelOptInfo *final_rel;
 	ListCell   *lc;
+	bool splitted_projection = false;
 
 	/* Tweak caller-supplied tuple_fraction if have LIMIT/OFFSET */
 	if (parse->limitCount || parse->limitOffset)
@@ -1657,7 +1659,7 @@ grouping_planner(PlannerInfo *root, bool inheritance_update,
 		 * that were obtained within query_planner().
 		 */
 		sub_target = make_scanjoin_target(root, tlist,
-		  &groupColIdx);
+		  &groupColIdx, &splitted_projection);
 
 		/*
 		 * Forcibly apply that tlist to all the Paths for the scan/join rel.
@@ -1801,6 +1803,13 @@ grouping_planner(PlannerInfo *root, bool inheritance_update,
 	{
 		Path	   *path = (Path *) lfirst(lc);
 
+		if (splitted_projection)
+		{			
+			path = apply_projection_to_path(root, current_rel,
+			path, create_pathtarget(root, tlist));
+		}
+
+
 		/*
 		 * If there is a FOR [KEY] UPDATE/SHARE clause, add the LockRows node.
 		 * (Note: we intentionally test parse->rowMarks not root->rowMarks
@@ -3775,15 +3784,17 @@ create_ordered_paths(PlannerInfo *root,
 static PathTarget *
 make_scanjoin_target(PlannerInfo *root,
 	 List *tlist,
-	 AttrNumber **groupColIdx)
+	 AttrNumber **groupColIdx,
+	 bool* splitted_projection)
 {
 	Query	   *parse = root->parse;
-	List	   *sub_tlist;
-	List	   *non_group_cols;
+	List	   *sub_tlist = NIL;
+	List	   *non_group_cols = NIL;
 	List	   *non_group_vars;
 	int			numCols;
 
 	*groupColIdx = NULL;
+	*splitted_projection = false;
 
 	/*
 	 * If we're not grouping or aggregating or windowing, there's nothing to
@@ -3791,14 +3802,66 @@ make_scanjoin_target(PlannerInfo *root,
 	 */
 	if (!parse->hasAggs && !parse->groupClause && !parse->groupingSets &&
 		!root->hasHavingQual && !parse-

Re: [HACKERS] Optimizer questions

2016-03-09 Thread Konstantin Knizhnik




On 09.03.2016 09:15, Tom Lane wrote:

I wrote:

BTW, there's some additional refactoring I had had in mind to do in
grouping_planner to make its handling of the targetlist a bit more
organized; in particular, I'd like to see it using PathTarget
representation more consistently throughout the post-scan-join steps.

See 51c0f63e4d76a86b44e87876a6addcfffb01ec28 --- I think this gets
things to where we could plug in additional levels of targets without
too much complication.

regards, tom lane


So, if I correctly understand you, there are two major concerns:

1. Volatile functions. I wonder if it is really required to reevaluate 
volatile function for each record even if LIMIT clause is present?

Documentation says:
"A query using a volatile function will re-evaluate the function at 
every row where its value is needed."
So if we are using sort with limit and value of function is not used in 
sort, then we it is correct to say that value of this function is no 
needed, so there is no need to re-evaluate it, isn't it?


2. Narrowing functions, like md5.
Here I do not have any good idea how to support it. Looks like cost of 
SORT should depend on tuple width. Only in this case optimizer can 
determine whether it is more efficient to evaluate function earlier or 
postpone its execution.


I think that the best approach is to generate two different paths: 
original one, when projection is always done before sort and another one 
with postponed projection of non-trivial columns. Then we compare costs 
of two paths and choose the best one.
Unfortunately, I do not understand now how to implement it with existed 
grouping_planner.

Do you think that it is possible?

Alternative approach is to do something like in my proposed patch, but 
take in account cost of function execution and check presence of 
volatile/narrowing functions. This approach provides better flexibility, 
because we can choose subset of columns not-used in sort, which 
evaluation should be postponed. But here we once again make local 
decision while construction of the path instead of comparing costs of 
full paths.





--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Optimizer questions

2016-03-10 Thread konstantin knizhnik


On Mar 10, 2016, at 1:56 AM, Tom Lane wrote:

> Konstantin Knizhnik  writes:
>> I think that the best approach is to generate two different paths: 
>> original one, when projection is always done before sort and another one 
>> with postponed projection of non-trivial columns. Then we compare costs 
>> of two paths and choose the best one.
>> Unfortunately, I do not understand now how to implement it with existed 
>> grouping_planner.
>> Do you think that it is possible?
> 
> After fooling with this awhile, I don't think it's actually necessary
> to do that.  See attached proof-of-concept patch.

O, you did all my work:)

But right now the rule for cost estimation makes it not possible to apply this 
optimization for simple expressions like this:

postgres=# create table a (b integer);
CREATE TABLE
postgres=# insert into a values (generate_series(1, 10));
INSERT 0 10
postgres=# select b+b from a order by b limit 1;
NOTICE:  int4pl
NOTICE:  int4pl
NOTICE:  int4pl
NOTICE:  int4pl
NOTICE:  int4pl
NOTICE:  int4pl
NOTICE:  int4pl
NOTICE:  int4pl
NOTICE:  int4pl
NOTICE:  int4pl
 ?column? 
--
2
(1 row)
postgres=# create or replace function twice(x integer) returns integer as $$ 
begin raise notice 'exec function' ; return x + x ; end $$ language plpgsql;
CREATE FUNCTION
postgres=# select twice(b) from a order by b limit 1;   
NOTICE:  exec function
NOTICE:  int4pl
 twice 
---
 2
(1 row)


I wonder is there any advantages of earlier evaluation of such simple 
expressions if them are not needed for sort?
It seems to be only true for narrowing functions, like md5... But I think that 
it is quite exotic case, isn't it?
May be it is reasonable to be more optimistic in estimation cost of postponed 
expression evaluation?


Also I do not completely understand your concern about windows functions.
Is there any example of query with windows or aggregate functions when this 
optimization (postponing expression evaluation) can be applied?
It will be also interesting to me to know if there are some other cases (except 
SORT+LIMIT) when delaying projection leeds to more efficient plan.



> 
> Although this patch gets through our regression tests, that's only because
> it's conservative about deciding to postpone evaluation; if it tried to
> postpone evaluation in a query with window functions, it would fail
> miserably, because pull_var_clause doesn't know about window functions.
> I think that that's a clear oversight and we should extend it to offer
> the same sorts of behaviors as it does for Aggrefs.  But that would be
> slightly invasive, there being a dozen or so callers; so I didn't bother
> to do it yet pending comments on this patch.
> 
> I think it's probably also broken for SRFs in the tlist; we need to work
> out what semantics we want for those.  If we postpone any SRF to after
> the Sort, we can no longer assume that a query LIMIT enables use of
> bounded sort (because the SRF might repeatedly return zero rows).
> I don't have a huge problem with that, but I think now would be a good
> time to nail down some semantics.
> 
> Some other work that would be useful would be to refactor so that the
> cost_qual_eval operations aren't so redundant.  But that's just a small
> time savings not a question of functionality.
> 
> And we'd have to document that this changes the behavior for volatile
> functions.  For the better, I think: this will mean that you get
> consistent results no matter whether the query is implemented by
> indexscan or seqscan-and-sort, which has never been true before.
> But it is a change.
> 
> Do people approve of this sort of change in general, or this patch
> approach in particular?  Want to bikeshed the specific
> when-to-postpone-eval policies implemented here?  Other comments?
> 
>   regards, tom lane
> 
> diff --git a/src/backend/optimizer/plan/planner.c 
> b/src/backend/optimizer/plan/planner.c
> index 8937e71..b15fca1 100644
> *** a/src/backend/optimizer/plan/planner.c
> --- b/src/backend/optimizer/plan/planner.c
> *** static RelOptInfo *create_distinct_paths
> *** 126,131 
> --- 126,132 
> RelOptInfo *input_rel);
>  static RelOptInfo *create_ordered_paths(PlannerInfo *root,
>RelOptInfo *input_rel,
> +  PathTarget *target,
>double limit_tuples);
>  static PathTarget *make_group_input_target(PlannerInfo *root, List *tlist);
>  static List *postprocess_setop_tlist(List *new_tlist, List *orig_tlist);
> *** static PathTarget *make_window_input_tar
> *** 13

Re: [HACKERS] eXtensible Transaction Manager API (v2)

2016-03-11 Thread Konstantin Knizhnik

ral arbiter, so we have to introduce 
"in-doubt" state of transaction, when it is not known whether transaction is
committed or aborted and any other transaction accessing tuples updated but this 
transaction has to wait while its status is "in-doubt".
The main challenge of pg_tsdtm is to make this period as short as possible...

But it is details of particular implementation which IMHO have no relation to 
API itself.




3. Uh, how can you hook GetNewTransactionId but not ReadNewTransactionId?


Uh-uh-uh:)
ReadNewTransactionId is just reading value of ShmemVariableCache->nextXid,
but unfortunately it is not the only point where nextXid is used - there are 
about hundred occurrences of nextXid in Postgres core.
This is why we made a decision that GetNewTransactionId should actually update 
ShmemVariableCache->nextXid, so that
there is no need to rewrite all this code.
Sorry, but IMHO it is problem of Postgres design and not of XTM;)
We just want to find some compromise which allows XTM to be flexible enough but 
minimize changes in core code.


4. There seems to be an intention to encapsulate snapshots, but surely
wrapping hooks around GetSnapshotData and XidInMVCCSnapshot is not nearly
enough for that.  Look at all the knowledge snapmgr.c has about snapshot
representation, for example.  And is a function like GetOldestXmin even
meaningful with a different notion of what snapshots are?  (For that
matter, is TransactionId == uint32 still tenable for any other notion
of snapshots?)


XTM encapsulation of snapshots allows us to implement pg_dtm.
It does almost the same as Postgres-XL GTM, but without huge amount of #ifdefs.

Representation of XID is yet another compromise point: we do not want to change 
tuple header format.
So XID is still 32 bit and has the same meanining as in PostgreSQL. If custom 
implementation of TM wants to use some other identifiers of transactions,
like CSN in pg_tsdtm, it has to provide mapping between them and XIDs.




5. BTW, why would you hook at XidInMVCCSnapshot rather than making use of
the existing capability to have a custom SnapshotSatisfiesFunc snapshot
checker function?


HeapTupleSatisfies routines in times/tqual.c have implemented a lot of logic of 
handling different kind of snapshots, checking/setting hint bits in tuples,
caching,... We do not want to replace or just cut© all this code in DTM 
implementation.
And XidInMVCCSnapshot is common function finally used by most 
HeapTupleSatisfies* functions when all other checks are passed.
So it is really the most convenient place to plug-in custom visibility checking 
rules. And as far as I remember similar approach was used in Postgres-XL.




IMO this is not committable as-is, and I don't think that it's something
that will become committable during this 'fest.  I think we'd be well
advised to boot it to the 2016-09 CF and focus our efforts on other stuff
that has a better chance of getting finished this month.
regards, tom lane



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Batch update of indexes

2016-03-14 Thread Konstantin Knizhnik


Hi David,

Rebased patch is attached.

On 14.03.2016 15:09, David Steele wrote:

Hi Konstantin,

On 2/3/16 11:47 AM, Konstantin Knizhnik wrote:

Attached please find patch for "ALTER INDEX ... WHERE ..." clause.
It is now able to handle all three possible situations:
1. Making index partial (add WHERE condition to the ordinary index)
2. Extend partial index range (less restricted index predicate)
3. Arbitrary change of partial index predicate

In case 2) new records are added to the index.
In other two cases index is completely reconstructed.

This patch includes src/bin/insbench utility for testing insert
performance. It can be easily excluded from the patch to reduce it size.
Also it is better to apply this patch together with "index-only scans
with partial indexes" patch:


This patch no longer applies on master:

$ git apply ../other/alter-index.patch
../other/alter-index.patch:27: trailing whitespace.
static void
../other/alter-index.patch:62: indent with spaces.
SPIPlanPtr plan;
../other/alter-index.patch:63: indent with spaces.
Portal portal;
../other/alter-index.patch:90: trailing whitespace.

../other/alter-index.patch:99: trailing whitespace.

error: patch failed: src/test/regress/expected/aggregates.out:831
error: src/test/regress/expected/aggregates.out: patch does not apply

Please provide a new patch for review.  Meanwhile I am marking this 
"waiting on author".


Thanks,


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/doc/src/sgml/ref/alter_index.sgml b/doc/src/sgml/ref/alter_index.sgml
index ee3e3de..ab042ed 100644
--- a/doc/src/sgml/ref/alter_index.sgml
+++ b/doc/src/sgml/ref/alter_index.sgml
@@ -27,6 +27,7 @@ ALTER INDEX [ IF EXISTS ] name SET
 ALTER INDEX [ IF EXISTS ] name RESET ( storage_parameter [, ... ] )
 ALTER INDEX ALL IN TABLESPACE name [ OWNED BY role_name [, ... ] ]
 SET TABLESPACE new_tablespace [ NOWAIT ]
+ALTER INDEX name WHERE predicate 
 
  
 
@@ -109,6 +110,18 @@ ALTER INDEX ALL IN TABLESPACE name
 

 
+   
+WHERE predicate
+
+ 
+  Add or change predicate of partial index. Extending partial index predicate allows to implement batch update of index and so 
+  increase insert speed. New records (not matching index predicate) can be added to the table at maximal speed without affecting indexes.
+  Later, in background, indexes can be refreshed using ALTER INDEX ... WHERE ... clause.
+  See  for more discussion.
+ 
+
+   
+
   
   
 
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 13b04e6..a63de2a 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -34,6 +34,7 @@
 #include "commands/tablespace.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
+#include "funcapi.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
 #include "optimizer/planner.h"
@@ -52,6 +53,9 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
+#include "utils/ruleutils.h"
+#include "executor/executor.h"
+#include "executor/spi.h"
 
 
 /* non-export function prototypes */
@@ -280,6 +284,160 @@ CheckIndexCompatible(Oid oldId,
 	return ret;
 }
 
+static void
+UpdateIndex(Oid indexRelationId, Node* whereClause)
+{
+	Datum		values[Natts_pg_index];
+	bool		isnull[Natts_pg_index];
+	HeapTuple   oldTuple;
+	HeapTuple   newTuple;
+	Relation	pg_index;
+
+	pg_index = heap_open(IndexRelationId, RowExclusiveLock);
+	oldTuple = SearchSysCacheCopy1(INDEXRELID, ObjectIdGetDatum(indexRelationId));
+	if (!HeapTupleIsValid(oldTuple))
+		elog(ERROR, "cache lookup failed for index %u", indexRelationId);
+
+	heap_deform_tuple(oldTuple, RelationGetDescr(pg_index), values, isnull);
+	values[Anum_pg_index_indpred - 1] = CStringGetTextDatum(nodeToString(whereClause));
+	isnull[Anum_pg_index_indpred - 1] = false;
+	newTuple = heap_form_tuple(RelationGetDescr(pg_index), values, isnull);
+	simple_heap_update(pg_index, &oldTuple->t_self, newTuple);
+	CatalogUpdateIndexes(pg_index, newTuple);
+	heap_freetuple(newTuple);
+	heap_freetuple(oldTuple);
+	heap_close(pg_index, NoLock);
+}
+
+void
+AlterIndex(Oid indexRelationId, IndexStmt *stmt)
+{
+	char* select;
+	Oid heapRelationId;
+	IndexUniqueCheck checkUnique;
+	Datum		values[INDEX_MAX_KEYS];
+	bool		isnull[INDEX_MAX_KEYS];
+	Relation heapRelation;
+	Relation indexRelation;
+SPIPlanPtr plan;
+Portal portal;
+	HeapTuple tuple;
+	TupleTableSlot *slot;
+	ItemPointer tupleid;
+	IndexInfo  *indexInfo;
+	EState *estate;
+	Oid	namespaceId;
+	List*   deparseCtx;
+	char*   oldIndexPredicate;
+	char*   newIndexPredicate;
+	char*   relationName;
+
+	Assert(stmt->whereClause);
+	CheckPredicate((Expr *) stmt->whereClause);
+
+	/* Open and lock the

[HACKERS] Applying logical replication changes by more than one process

2016-03-19 Thread Konstantin Knizhnik

Hi,

I am trying to use logical replication mechanism in implementation of
PostgreSQL multimaster and faced with one conceptual problem.
Originally logical replication was intended to support asynchronous
replication. In this case applying changes by single process should not be a
bottleneck.
But if we are using distributed transaction manager to provide global consistency, then applying transaction by one worker leads to very bad performance and what is worser: cause unintended serialization of transactions, which is not taken in account by
distributed deadlock detection algorithm and so can cause

undetected deadlocks.

So I have implemented pool of background workers which can apply transactions
concurrently.
It works and shows acceptable performance. But now I am thinking about HA and tracking origin LSNs which are needed to correctly specify slot position in case of recovery. And there is a problem: as far as I understand to correctly record origin LSN in WAL
and advance slot position it is necessary to setup session

using replorigin_session_setup. It is not so convenient in case of using pool
of background workers, because we have to setup session for each commit.
But the main problem is that for each slot session can be associated only with
one process:

else if (curstate->acquired_by != 0)
{
ereport(ERROR,
(errcode(ERRCODE_OBJECT_IN_USE),
errmsg("replication identifier %d is already active for PID %d",
curstate->roident, curstate->acquired_by)));
}

Which once again means that there can be only one process applying changes.

To provide correct state of replication node it is necessary to enforce that each logical replication record is replayed exactly once: we should not loose some change or try to apply it twice. So operation of recording original LSN position in WAL and
adjusting slot should be atomic. And during recovery we should restore slot current position based on the origin values extracted from WAL. I wonder if it can be done using current logical replication mechanism when changes of each slot are applied by
more than one process? Or the only alternative is to write/read origin LSNs in WAL myself, for example using custom WAL records?

Thanks in advance!

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Applying logical replication changes by more than one process

2016-03-21 Thread Konstantin Knizhnik




On 21.03.2016 15:10, Petr Jelinek wrote:

Hi,

On 19/03/16 11:46, Konstantin Knizhnik wrote:

Hi,

I am trying to use logical replication mechanism in implementation of
PostgreSQL multimaster and faced with one conceptual problem.
Originally logical replication was intended to support asynchronous
replication. In this case applying changes by single process should not
be a bottleneck.
But if we are using distributed transaction manager to provide global
consistency, then applying transaction by one worker  leads to very bad
performance and what is worser: cause unintended serialization of
transactions, which is not taken in account by distributed deadlock
detection algorithm and so can cause
undetected deadlocks.

So I have implemented pool of background workers which can apply
transactions concurrently.
It works and shows acceptable performance. But now I am thinking about
HA and tracking origin LSNs which are needed to correctly specify slot
position in case of recovery. And there is a problem: as far as I
understand to correctly record origin LSN in WAL and advance slot
position it is necessary to setup session
using replorigin_session_setup. It is not so convenient in case of using
pool of background workers, because we have to setup session for each
commit.
But the main problem is that for each slot session can be associated
only with one process:

 else if (curstate->acquired_by != 0)
 {
 ereport(ERROR,
 (errcode(ERRCODE_OBJECT_IN_USE),
  errmsg("replication identifier %d is already active for
PID %d",
 curstate->roident, curstate->acquired_by)));
 }

Which once again means that there can be only one process applying 
changes.




That's not true, all it means is that you can do 
replorigin_session_setup for same origin only in one process but you 
don't need to have it setup for session to update it, the 
replorigin_advance() works just fine.


But RecordTransactionCommit is using replorigin_session_advance, not 
replorigin_advance.

And replorigin_session_advance requires that session was setup:

void
replorigin_session_advance(XLogRecPtr remote_commit, XLogRecPtr 
local_commit)

{
Assert(session_replication_state != NULL);
}

"session_replication_state" is private variable which is set by 
replorigin_session_setup.
But attempt to call replorigin_session_setup from multiple process cause 
above error.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Applying logical replication changes by more than one process

2016-03-21 Thread konstantin knizhnik

On Mar 21, 2016, at 4:30 PM, Petr Jelinek wrote:

> On 21/03/16 14:25, Andres Freund wrote:
>> On 2016-03-21 14:18:27 +0100, Petr Jelinek wrote:
>>> On 21/03/16 14:15, Andres Freund wrote:
> Only when the origin is actually setup for the current session. You
> need
> to call the replorigin_advance yourself from your apply code.

 That's problematic from a durability POV.

>>> 
>>> Huh? How come?
>> 
>> If you use the session mechanism the replication progress is synced with
>> the apply process, even if there are crashes. Crash recovery updates the
>> progress.  There's no such interlock with apply otherwise, and I don't
>> see how you can build one with reasonable effort.
>> 
> 
> Ah you mean because with wal_log=true the origin advance is in different WAL 
> record than commit? OK yeah you might be one transaction behind then, true.

It actually means that we can not enforce database consistency. If we do 
replorigin_advance  before commit and then crash happen, then we will loose 
some changes.
If we call replorigin_advance after commit but crash happen before, then some 
changes can be applied multiple times. For example we can insert some record 
twice (if there are no unique constraints).
Look likes the only working scenario is to setup replication session for each 
commit and use locking to prevent concurrent session setup for the same slot by 
multiple process,  doesn't it?
I have tried it, fortunately it doesn't cause any noticeable performance 
degradation. But unfortunately  can't consider such approach as elegant.
Why it is actually necessary to bind replication slot to process? Why it is not 
possible to have multiple concurrent sessions for the same slot?

Also I concern about using sequential search for slot location in 
replorigin_session_setup and many other functions - there is loop through all   
  max_replication_slots.
It seems to be not a problem when number of slots is less than 10. For 
multimaster this assumption is true - even Oracle RAC rarely has two-digit 
number of nodes.
But if we want to perform sharding and use logical replication for providing 
redundancy, then number of nodes and slots can be essentially larger.
I didn't think much about such configuration - may be it possible to propose 
more efficient mechanism for replication in this case.

> 
> -- 
>  Petr Jelinek  http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Applying logical replication changes by more than one process

2016-03-22 Thread konstantin knizhnik


On Mar 22, 2016, at 10:10 AM, Craig Ringer wrote:

> On 22 March 2016 at 14:32, konstantin knizhnik  
> wrote:
>  
>> Ah you mean because with wal_log=true the origin advance is in different WAL 
>> record than commit? OK yeah you might be one transaction behind then, true.
> 
> It actually means that we can not enforce database consistency. If we do 
> replorigin_advance  before commit and then crash happen, then we will loose 
> some changes.
> If we call replorigin_advance after commit but crash happen before, then some 
> changes can be applied multiple times. For example we can insert some record 
> twice (if there are no unique constraints).
> Look likes the only working scenario is to setup replication session for each 
> commit and use locking to prevent concurrent session setup for the same slot 
> by multiple process,  doesn't it?
> 
> Yes.
> 
> How would you expect it to work if you attempted to replorigin_advance 
> without a session? From multiple concurrent backends?

I would not work. But I wonder why I would need to call replorigin_advance 
without a session.
Please excuse me, I am not thinking now about the general case of using logical 
replication, I just focused on multimaster.
What I need is some efficient, durable and atomic mechanism for applying 
changes.
I see only two ways to provide atomicity:
1. Tracking of origins should be done inside xact as part of normal commit.
2. Use custom WAL records.

1) is easier now and it really works if I correctly synchronize access to 
slots. And surprisingly it even doesn't add substantial overhead.

> 
> Parallel apply is complicated business. You have to make sure you apply xacts 
> in an order that's free from deadlocks and from insert/delete anomalies - 
> though you can at least detect those, ERROR that xact and all subsequent 
> ones, and retry.

Well, this is exactly what our multimaster does. We do not try to enforce order 
of applying xacts. But we detect global deadlocks and use 2PC to provide data 
consistency.
So it is not task of logical replication, it is done by DTM overriding  
visibility checks and transaction commit protocol using XTM.


> For progress tracking to be consistent and correct you'd have to make sure 
> you committed strictly in the same order as upstream. Just before each commit 
> you can set the origin LSN and advance the replication origin, which will 
> commit atomically along with the commit it confirms. I don't really see the 
> problem.

Sorry, I do not completely understand you. What you mean by "will commit 
atomically along with the commit it confirms"? How this atomicity will be 
enforced?

>  
> I have tried it, fortunately it doesn't cause any noticeable performance 
> degradation. But unfortunately  can't consider such approach as elegant.
> Why it is actually necessary to bind replication slot to process? Why it is 
> not possible to have multiple concurrent sessions for the same slot?
> 
> Especially since most slot changes LWLock- and/or spinlock-protected already.
> 
> The client would have to manage replay confirmations appropriately so that it 
> doesn't confirm past the point where some other connection still needs it.
> 
> We'd have to expose a "slot" column in pg_stat_replication and remove the 
> "pid" column from pg_replication_slots to handle the 1:n relationship between 
> slot clients and slots, and it'd be a pain to show which normal user backends 
> were using a slot. Not really sure how to handle that.
> 
> To actually make this useful would require a lot more though. A way to 
> request that replay start from a new LSN without a full disconnect/reconnect 
> each time. Client-side parallel consume/apply. Inter-transaction ordering 
> information so the client can work out a viable xact apply order (possibly 
> using SSI information per the discussion with Kevin?). Etc.
> 
> I haven't really looked into this and I suspect there are some hairy areas 
> involved in replaying a slot from more than one client. The reason I'm 
> interested in it personally is for initial replica state setup as Oleksandr 
> prototyped and described earlier. We could attach to the slot's initial 
> snapshot then issue a new replication command that, given a table name or 
> oid, scans the table from the snapshot and passes each tuple to a new 
> callback (like, but not the same as, the insert callback) on the output 
> plugin.
> 
> That way clients could parallel-copy the initial state of the DB across the 
> same replication protocol they then consume new changes from, with no need to 
> make normal libpq connections and COPY initial state.
> 
> I'm interested in being able to do parallel receive of ne

Re: [HACKERS] Applying logical replication changes by more than one process

2016-03-22 Thread konstantin knizhnik

On Mar 22, 2016, at 11:14 AM, Petr Jelinek wrote:
> 
> And each slot means connection with logical decoding attached to it so you 
> don't really want to have thousands of those anyway. I think you'll hit other 
> problems faster than loop over slots becomes problem if you plan to keep all 
> of them active.

Assume that cluster have thousands of nodes and we use sharding to scatter data 
through cluster nodes.
But to provide HA we want to perform sharding with some level of redundancy, 
for example save the same record at 3 different nodes.
Once possible approach (pg_shard) is to execute the same query at three 
different shards.
But there is no warranty that  result of execution will be the same at all 
nodes.
Alternative approach is to execute transaction at one node and then replicate 
it using logical replication to replicas.
So we do not perform logical replication to all 1000 nodes. Just to 2 of them. 
But each time it will be different pair of nodes. So we still need to have 1000 
active replication slots.

May be logical replication can not be used at all in such scenario - I have not 
thought much about it yet. Our first step will be multimaster without sharding.

> 
> -- 
>  Petr Jelinek  http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] avg,first,last,median in one query

2016-03-24 Thread Konstantin Knizhnik


Hi, hackers.

I need advice from SQL experts: is there any way in PostgreSQL to 
calculate avg,first,last,median aggregates in one query?

Assume that we have the following table:

create table Securities ("Symbol" varchar, "Date" date, "Time" time, 
"Price" real);


We can simulate median using percentile_disc:

select "Symbol","Date",
avg("Price"),
percentile_disc(0.5) within group (order by "Price")
from Securities
group by "Symbol","Date";

And all other aggregates can be calculated using windows functions:

select distinct "Symbol","Date",
first_value("Price") over (partition by "Symbol","Date" order by 
"Time" rows between unbounded preceding and unbounded following),
last_value("Price") over (partition by "Symbol","Date" order by 
"Time" rows between unbounded preceding and unbounded following),
avg("Price") over (partition by "Symbol","Date" rows between 
unbounded preceding and unbounded following)

from Securities;

I wonder is there are any simpler/efficient alternative to the query above?

But unfortunately it is not possible to calculate median is such way 
because percentile_disc is not compatible with OVER:


ERROR: OVER is not supported for ordered-set aggregate percentile_disc

So is there any chance to calculate all this four aggregates in one 
query without writing some supplementary functions?


Additional question: what is the most efficient way of calculating 
MEDIAN in PostgreSQL?

I found three different approaches:

1. Using CTE:

https://www.periscopedata.com/blog/medians-in-sql.html

2. Using user-defined aggregate function which uses array_appendand so 
materialize all values in memory:


https://wiki.postgresql.org/wiki/Aggregate_Median

3. Using percentile aggregate:

http://blog.jooq.org/2015/01/06/how-to-emulate-the-median-aggregate-function-using-inverse-distribution-functions/


Thanks in advance,

--

Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Very small patch for decode.c

2016-03-30 Thread Konstantin Knizhnik


diff --git a/src/backend/replication/logical/decode.c 
b/src/backend/replication/logical/decode.c
index 2380ea2..a992662 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -488,7 +488,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer 
*buf,
 {
XLogRecPtr  origin_lsn = InvalidXLogRecPtr;
TimestampTz commit_time = parsed->xact_time;
-   XLogRecPtr  origin_id = XLogRecGetOrigin(buf->record);
+   RepOriginId origin_id = XLogRecGetOrigin(buf->record);
int     i;

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Limit and inherited tables

2016-01-15 Thread Konstantin Knizhnik


Hi,

I am sorry if this question was already discussed but I failed to find 
any information about in archive.

I noticed that LIMIT clause is not pushed down to inherited tables.
Consider the following tables:

create table foo(x integer primary key);
create table foo1 () inherits(foo);
create table foo2 () inherits(foo);
insert into foo1 values (generate_series(0,10));
insert into foo2 values (generate_series(0,10));


explain select * from foo order by x limit 1;
   QUERY PLAN

 Limit  (cost=5.10..5.10 rows=1 width=4)
   ->  Sort  (cost=5.10..5.61 rows=20 width=4)
 Sort Key: foo.x
 ->  Append  (cost=0.00..4.06 rows=20 width=4)
   ->  Seq Scan on foo  (cost=0.00..0.00 rows=1 width=4)
   ->  Seq Scan on foo1  (cost=0.00..2.03 rows=10 width=4)
   ->  Seq Scan on foo2  (cost=0.00..2.03 rows=10 width=4)
(7 rows)

So Postgres has to merge two large data sets and sort the result, while 
the optimal plan is to take just one record from each inherited table, 
sort 2 records and then limit the result.


Such optimization will be especially useful in case of using 
postgres_fdw - when inherited tables are located at remote nodes.
Are there any plans to support this optimization or may be somebody is 
already working on it?

Otherwise I can try to investigate it and propose optimizer patch for it.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Limit and inherited tables

2016-01-15 Thread Konstantin Knizhnik




This example is lacking indexes on the child tables, which is
why the plan shown is about as good as you're going to get.
The contents of foo1 and foo2 have to be read in entirety in any
case, and sorting them separately is not a win compared to doing
a single sort.

It is true, but not in case of FDW connected to remote host.
In this case sending large volumes of data through network will be very 
inefficient.


There will be no problem if FDW can provide index scan - in this case 
MergeAppend will fetch only required number of records:


postgres=# explain analyze select * from t order by u limit 1;
  QUERY PLAN
---
 Limit  (cost=300.17..300.23 rows=1 width=8) (actual time=4.588..4.588 
rows=1 loops=1)
   ->  Merge Append  (cost=300.17..762.76 rows=7681 width=8) (actual 
time=4.586..4.586 rows=1 loops=1)

 Sort Key: t.u
 ->  Index Scan using t_pkey on t  (cost=0.12..8.14 rows=1 
width=8) (actual time=0.003..0.003 rows=0 loops=1)
 ->  Foreign Scan on t_fdw1  (cost=100.00..193.92 rows=2560 
width=8) (actual time=1.532..1.532 rows=1 loops=1)
 ->  Foreign Scan on t_fdw2  (cost=100.00..193.92 rows=2560 
width=8) (actual time=1.510..1.510 rows=1 loops=1)
 ->  Foreign Scan on t_fdw3  (cost=100.00..193.92 rows=2560 
width=8) (actual time=1.535..1.535 rows=1 loops=1)


But if sort is performed by non-indexed fields, then current behaviour 
will be inefficient and can be significantly improved by pushing limits 
to remote hosts.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Optimizer questions

2016-01-18 Thread Konstantin Knizhnik

I am sorry for badly formatted query - I just cut&paste it from C++ 
client program.


I have one more question to community regarding this patch.
Proposed patch fix the problem particularly for SORT+LIMIT clauses.
In this case evaluation of expressions which are not used in sort is 
alway waste of time.
But I wonder if we should delay evaluation of complex expressions even 
if there is no LIMIT?
Quite often client application doesn't fetch all query results even if 
there is no LIMIT clause.




On 18.01.2016 05:47, Bruce Momjian wrote:

On Tue, Jan  5, 2016 at 05:55:28PM +0300, konstantin knizhnik wrote:

Hi hackers,

I want to ask two questions about PostgreSQL optimizer.
I have the following query:

SELECT o.id as id,s.id as sid,o.owner,o.creator,o.parent_id
as dir_id,s.mime_id,m.c_type,s.p_file,s.mtime,o.ctime,o.name
,o.title,o.size,o.deleted,la.otime,la.etime,uo.login as owner_login,uc.login as
creator_login,(CASE WHEN f.user_id IS NULL THEN 0 ELSE 1 END) AS flagged,
(select 'userid\\:'||string_agg(user_id,' userid\\:') from get_authorized_users
(o.id)) as acl FROM objects s JOIN objects o ON s.original_file=o.id LEFT JOIN
flags f ON o.id=f.obj_id AND o.owner=f.user_id LEFT JOIN objects_last_activity
la ON o.id = la.obj_id AND o.owner = la.user_id, mime m, users uc , users uo
WHERE (s.mime_id=904 or s.mime_id=908) AND m.mime_id = o.mime_id AND o.owner =
uo.user_id AND o.creator = uc.user_id ORDER BY s.mtime LIMIT 9;

FYI, I could not make any sense out of this query, and I frankly can't
figure out how others can udnerstand it.  :-O   Anyway, I ran it through
pgFormatter (https://github.com/darold/pgFormatter), which I am showing
here because I was impressed with the results:

SELECT
o.id AS id,
s.id AS sid,
o.owner,
o.creator,
o.parent_id AS dir_id,
s.mime_id,
m.c_type,
s.p_file,
s.mtime,
o.ctime,
o.name,
o.title,
o.size,
o.deleted,
la.otime,
la.etime,
uo.login AS owner_login,
uc.login AS creator_login,
(
CASE
WHEN f.user_id IS NULL THEN 0
ELSE 1
END ) AS flagged,
(
SELECT
'userid\\:' || string_agg (
user_id,
' userid\\:' )
FROM
get_authorized_users (
o.id ) ) AS acl
FROM
objects s
JOIN objects o ON s.original_file = o.id
LEFT JOIN flags f ON o.id = f.obj_id
AND o.owner = f.user_id
LEFT JOIN objects_last_activity la ON o.id = la.obj_id
AND o.owner = la.user_id,
mime m,
users uc,
users uo
WHERE (
s.mime_id = 904
OR s.mime_id = 908 )
AND m.mime_id = o.mime_id
AND o.owner = uo.user_id
AND o.creator = uc.user_id
    ORDER BY
s.mtime
LIMIT 9;



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Batch update of indexes

2016-01-20 Thread Konstantin Knizhnik


Hi hackers,

I want to know opinion of community about possible ways of solving quite 
common problem: increasing insert speed while still providing indexes 
for efficient execution of queries.


Many applications have to deal with high input stream of data. Most of 
the time while record inserting in the database is taken for update of 
indexes. And without indexes we are not able to efficiently execute 
queries.
So in many cases it is desirable to have "batch or concurrent" index 
update. And it is acceptable that an index is slightly behind current 
state of the table.


One interesting approach of solving this problem is discussed in this 
article:


https://mark.zealey.org/2016/01/08/how-we-tweaked-postgres-upsert-performance-to-be-2-3-faster-than-mongodb

Them are using materialized views to build indexes in background.
Interesting idea, but copying content of the whole table just to be able 
to build index concurrently seems to be overkill.


I thought about more straightforward ways of solving this problem. It 
will be nice if we can preserve of of them main postulates of Postgres 
and other RDBMSes: indexes are just optimization and result of query 
should not depend on presence of indexes.


First idea is to use inheritance. I have investigated different ways of 
splitting table into "archival" and "operational" parts, but all of them 
requiring physical copying of data from one table to another.


Another idea is to use partial indexes 
(http://www.postgresql.org/docs/current/static/indexes-partial.html)
Assume that we have stream of input data where each record have 
increased timestamp:


  create table t(
 ts timestamp primary key,
 c1 real,
 c2 integer,
 c3 varchar,
 ...
 cN char(5)
  );

We want to provide the highest insert speed for "t" but provide indexes 
for c1..cN fields.

We can declared partial indexes:

  create index idx1 on t(c1) where ts < '20/01/2016';
  create index idx2 on t(c2) where ts < '20/01/2016';
  ...
  create index idxN on t(cN) where ts < '20/01/2016';

As far as this indexes do not cover current date, them will not be 
affected during insert operations.

But we can still efficiently run queries like

  select * from t where c1>100 and ts < '20/01/2016';

Then, in background, may be at night, we can do

  alter index idx1 where ts < '21/01/2016';

Please notice that such alter table statement, changing condition for 
partial index, is not supported now.

But I do not see any principle problems with supporting such construction.
We should just include in the index all records which match new 
condition and do not match old condition:


   ts < '21/01/2016' and not (ts < '20/01/2016')

If there is index for "ts" field it can be done quite efficiently.
This approach doesn't cause contradictions with concepts of indexes in 
RDBMS.


But there is one more problem with this approach with I think should be 
addressed.
Right now optimizer builds the following execution plan for query with 
partial indexes:


 postgres=# explain select * from t where c1 < 10 and ts < 
'20/01/2016'::timestamp;

  QUERY PLAN
 
---
 Bitmap Heap Scan on t  (cost=7.20..732.14 rows=12263 width=12)
   Recheck Cond: ((c1 < '10'::double precision) AND (ts < '2016-01-20 
00:00:00'::timestamp without time zone))

   ->  Bitmap Index Scan on idx1  (cost=0.00..4.13 rows=12263 width=0)
 Index Cond: (c1 < '10'::double precision)
(4 rows)

As you can see optimizer insert recheck in query execution plan while it 
is not needed in this case: search condition is exactly the same as 
partial index condition.

Optimal plan should be:

   Index Scan using idx1 on t (cost=0.00..4.13 rows=12263 width=0)
   Index Cond: (c1 < '10'::double precision)


What do you think about this approach? Will it be useful to work in this 
direction?

Or there are some better solutions for the problem?

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Batch update of indexes

2016-01-20 Thread Konstantin Knizhnik

Hi,

Hi, I glad to see that you interested in that too.
I think this is a good feature and I think it will be very useful to have.
I have already mentioned some related problems and possible
improvements in my presentation.

http://www.slideshare.net/AnastasiaLubennikova/indexes-dont-mean-slow-inserts
Last two slides concern to this thread. Briefly, I've suggested to
think about insertion buffer. Something very similar to it is already
implemented in BRIN. It does not index last data from heap, while the
number of last pages is less than pages_per_block.

Do you mean GIN-like usage of insertion buffer (here it is called
"pending list")?

So that we have to combine search in the main tree and in the insert buffer?
Actually this is what I want to avoided (because at least in case of GIN
pending list cause significant degrade of performance,

while up-to-date state of full text index is rarely required).

The next point, I've thought about is a bulk update. Problem is that
update like "UPDATE mytable set a = a+1;" causes N searches from the
root of B-tree. I looks very strange to me, and I'd like to fix it
somehow. The obvious solution is to update all tuples on the page at a
time, and keep the number of last updated page. But, maybe it's a bit
off-thread here.

Bulk update is the second question (but very important).
First I just want to be able to append index concurrently, not blocking
insert.

One interesting approach of solving this problem is discussed in this
article:

https://mark.zealey.org/2016/01/08/how-we-tweaked-postgres-upsert-performance-to-be-2-3-faster-than-mongodb

Them are using materialized views to build indexes in background.
Interesting idea, but copying content of the whole table just to be
able to build index concurrently seems to be overkill.

This approach seems like a tricky crutch to me. And I agree that it
requires a lot of extra work.

It will be very interesting to know how people are using materialized views.
Delayed building of indexes seems to be one of the popular use cases,
although requiring large overhead, first of all storage overhead.

Please notice that such alter table statement, changing condition for
partial index, is not supported now.

Don't you think, that this feature could be used in a very wrong way?
Do not take it as criticism, just a bit of thoughts.

Everything which can be misused, will be misused:)
But I do not worry much about it...
If it can address real challenges, then it will be good thing in any case.

Ideally we should be able to alter everything. Naive implementation of
such alter clause can just to build new index with temporary name, then
drop old index and rename new index.

There was the discussion of the patch for partial indexes.
http://postgresql.nabble.com/PATCH-index-only-scans-with-partial-indexes-td5857568.html
Since I haven't watched it closely, It seems to be open still. I
think it'll be interesting to you.

So small patch...
Why it was not accepted?
I do no see any problems with it...

--
Anastasia Lubennikova
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal for UPDATE: do not insert new tuple on heap if update does not change data

2016-01-20 Thread Konstantin Knizhnik


Hi,

To eliminate creation of new tuple version in this case it is necessary 
to check that update actually doesn't change the record.
It is not a cheapest test and it seems to be not so good idea to perform 
it always.
But if you fill that in your case there are many "identical" updates, 
you can always explicitly rewrite query by adding extra check:


UPDATE foo SET val = 'second' where pk = 2 and val <> 'second';




On 20.01.2016 12:55, Gasper Zejn wrote:

Hi,

I was wondering if PostgreSQL adds new tuple if data is not changed
when using UPDATE. It turns out it does add them and I think it might
be beneficial not to add a new tuple in this case, since it causes a
great deal of maintenance: updating indexes, vacuuming table and
index, also heap fragmentation.

How to check:

CREATE TABLE foo (pk serial primary key, val text);
-- Starting point: two rows.
INSERT INTO foo VALUES (1, 'first');
INSERT INTO foo VALUES (2, 'second');
CHECKPOINT;

-- Updating row with same value.
UPDATE foo SET val = 'second' where pk = 2;
CHECKPOINT;

-- "Upsert" is the same.
INSERT INTO foo VALUES (2, 'second') ON CONFLICT (pk) DO UPDATE SET
val = 'second';
CHECKPOINT;

If after any checkpoint you look at page data, you can see multiple
versions of same row with "second".

Unfortunately, I don't believe I can come up with a patch on my own,
but will happily offer any further help with testing and ideas.


Attached is a script with minimal test case.

Kind regards,
Gasper Zejn




--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Batch update of indexes

2016-01-20 Thread konstantin knizhnik

On Jan 21, 2016, at 5:14 AM, Simon Riggs wrote:

> On 20 January 2016 at 14:55, Konstantin Knizhnik  
> wrote:
> Hi,
> 
> Hi, I glad to see that you interested in that too.
> I think this is a good feature and I think it will be very useful to have.
> I have already mentioned some related problems and possible improvements in 
> my presentation.
> http://www.slideshare.net/AnastasiaLubennikova/indexes-dont-mean-slow-inserts
> Last two slides concern to this thread. Briefly, I've suggested to think 
> about insertion buffer. Something very similar to it is already implemented 
> in BRIN. It does not index last data from heap, while the number of last 
> pages is less than pages_per_block.
> 
> Do you mean GIN-like usage of insertion buffer (here it is called "pending 
> list")?
> So that we have to combine search in the main tree and in the insert buffer?
> Actually this is what I want to avoided (because at least in case of GIN 
> pending list cause significant degrade of performance,
> while up-to-date state of full text index is rarely required).
> 
> Degrade in performance is because scan of pending list is O(N).
> 
> If we did the same thing for monotonic inserts into a btree, the performance 
> of ruling out any contents in the pending list would be O(1), so it is more 
> feasible than you say.

Sorry, didn't get it.
Certainly for B-Tree we can organize insert buffer (or pending list) as sorted 
array or also as a tree.
But in both case complexity of search in this buffer will be O(log(N)), not 
O(1).
O(1) is possible only if we will use hash table for pending list and are lucky 
with hash function.
But in this case it will be not possible to support range queries and proper 
order.

In any case, necessity to perform two searches instead of one and combining 
results will significantly complicate index implementation.
But definitely this solution is more convenient and transparent for users...

>  
> -- 
> Simon Riggshttp://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Batch update of indexes

2016-01-21 Thread Konstantin Knizhnik




On 21.01.2016 10:14, Simon Riggs wrote:
On 21 January 2016 at 06:41, konstantin knizhnik 
mailto:k.knizh...@postgrespro.ru>> wrote:


Certainly for B-Tree we can organize insert buffer (or pending
list) as sorted array or also as a tree.
But in both case complexity of search in this buffer will be
O(log(N)), not O(1).


You are right; search within this buffer is O(log(N)). But we can test 
whether we need to search in the buffer at all with O(1).


Only if records are inserted in key ascending order...
But usually we need to maintain more than once index and and for them it 
is not true.




--
Simon Riggs http://www.2ndQuadrant.com/ <http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Batch update of indexes

2016-01-21 Thread Konstantin Knizhnik




On 21.01.2016 19:09, Anastasia Lubennikova wrote:
What I meant is more like a BRIN-like combination of an index scan and 
heap scan.
Maybe it could be called "deferred inserts" or "temporary read-only 
index"
Maybe it's similar with mysql insert buffer 
http://dev.mysql.com/doc/refman/5.7/en/innodb-insert-buffering.html

I think it'll be more clear with example. Please don't care about syntax.

CREATE TABLE tbl (c1 int);
CREATE INDEX idx on tbl(c1);

SET enable_deferred_insert(idx) = on;
At this moment, we save the last_indexed_item (its TID) somewhere in 
index metapage.


Since that moment, the data inserted into the table doesn't touch the 
index.
We perform some heavy insert and then go back to the normal index 
behavior.


SET enable_deferred_insert(idx) = off;
This command takes all the data between the last_indexed_item and the 
end of the table, and inserts it into the index at a time.


Of course there are new problems to deal with, but it's really useful 
for the use case to balance irregular heavy write load, isn't it?


BTW, could you explain, what is the reason to copy data into the 
pending list and then copy it again while flushing pending list into 
the index? Why not read this data directly from the table? I feel that 
I've missed something important here.


No, I do  not think that inserted data should be placed in pending list 
and then copied to main table.
It should be stored directly in the main table and "pending list" is 
just some fast, transient index.


From my point of view there are two possibilities:
1. Preserve strict table-index consistency: query results should not 
depend on presence of the index
2. Support out-of-date or deferred indexes, which can be updated in 
background.


Second approach is certainty more efficient and IMHO it acceptable for 
most of the users.

But we are violating one of the fundamental properties of RDBMes...
So I am not sure which approach to chose.

First case is also harder to implement, because we have to somehow merge 
two indexes during index scan
and provide proper recovery of main index in case of failure (assuming 
that pending list is maintained in memory and is lost after the fault).



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Batch update of indexes

2016-01-26 Thread Konstantin Knizhnik

Hi hackers,

I have implemented "ALTER INDEX ... WHERE ..." clause allowing to change
condition for partial index.

Actually it allows us to append index without fully rebuilding it.
As I explained in the previous mails, partial indexes can be used to
increase insert speed.

Right now I get the following results (with one insert stream):

Insert with 1 index (primary key, monotonically ascending):
324275 TPS
Insert with 9 indexes (primary key + 8 indexes with random keys):
52495 TPS
Insert with primary key and 8 concurrently updated partial indexes:
194458 TPS
Insert with primary key and 8 "frozen" partial
indexes: 278446 TPS

So, as you can see insert with indexes is about 6 times slower than
insert without indexes.

And partial indexes allows to eliminate this gap.
When partial indexes are not affected (assuming that them will be
reconstructed "at night"),

performance is almost the same, as without indexes.
And if "ALTER INDEX" is done concurrently with inserts, it certainly
decrease insert speed,

but still it is 4 times faster than with normal indexes.

Such high TPS values were obtained using "insert from select" to bypass
libpq overhead.
With libpq (when each insert is sent as independent statement) results
are less impressive:

Insert with 1 index (primary key, monotonically ascending):
37892 TPS
Insert with 9 indexes (primary key + 8 indexes with random keys):
20231 TPS
Insert with primary key and 8 concurrently updated partial indexes:
26934 TPS
Insert with primary key and 8 "frozen" partial
indexes: 28863 TPS

But still partial indexes allows to almost eliminate two times
differences...

This results can be reproduced using our public repository:
https://github.com/postgrespro/postgres_cluster

Most of the code related with support of "ALTER INDEX .. WHERE" is in
AlterIndex function in

postgres_cluster/src/backend/commands/indexcmds.c
I have also added insbench utility for measuring insert performance,
using which this results were obtained.

It is located in postgres_cluster/src/bin/insbench directory.

Known issues:
1. I do not handle case when new condition for partial index is more
restricted than original.
There is no way in Postgres to exclude records from index (except
VACUUM), so in this case index has to be reconstructed from scratch.
2. Currently I am using SPI to locate records which should be included
in index.
3. I am not completely sure that there are no
synchronization/isolation problems in AlterIndex function

If this approach is considered to be interesting by community, I will
try to address these issues.

On 20.01.2016 12:28, Konstantin Knizhnik wrote:

Hi hackers,

I want to know opinion of community about possible ways of solving
quite common problem: increasing insert speed while still providing
indexes for efficient execution of queries.

Many applications have to deal with high input stream of data. Most of
the time while record inserting in the database is taken for update of
indexes. And without indexes we are not able to efficiently execute
queries.
So in many cases it is desirable to have "batch or concurrent" index
update. And it is acceptable that an index is slightly behind current
state of the table.

One interesting approach of solving this problem is discussed in this
article:

https://mark.zealey.org/2016/01/08/how-we-tweaked-postgres-upsert-performance-to-be-2-3-faster-than-mongodb

Them are using materialized views to build indexes in background.
Interesting idea, but copying content of the whole table just to be
able to build index concurrently seems to be overkill.

I thought about more straightforward ways of solving this problem. It
will be nice if we can preserve of of them main postulates of Postgres
and other RDBMSes: indexes are just optimization and result of query
should not depend on presence of indexes.

First idea is to use inheritance. I have investigated different ways
of splitting table into "archival" and "operational" parts, but all of
them requiring physical copying of data from one table to another.

Another idea is to use partial indexes
(http://www.postgresql.org/docs/current/static/indexes-partial.html)
Assume that we have stream of input data where each record have
increased timestamp:

create table t(
ts timestamp primary key,
c1 real,
c2 integer,
c3 varchar,
...
cN char(5)
);

We want to provide the highest insert speed for "t" but provide
indexes for c1..cN fields.

We can declared partial indexes:

create index idx1 on t(c1) where ts < '20/01/2016';
create index idx2 on t(c2) where ts < '20/01/2016';
...
create index idxN on t(cN) where ts < '20/01/2016';

As far as this indexes do not cover current d

Re: [HACKERS] Optimizer questions

2016-01-30 Thread Konstantin Knizhnik


Unfortunately this two statements are not equivalent: second one can (in 
theory, but not for this particular data set) return more than 10 result 
records.
Such optimization will be correct if t2.i is declared as unique.

But the most efficient plan for this query will be generated if there is index 
for t1.v.
In this case no explicit sot is needed. Limit is still not pushed down, but it 
is not a problem because nested join is used which is not materializing its 
result
(produces records on demand):

# explain analyze select * from t1 left outer join t2 on t1.k=t2.k order by 
t1.v limit 10;
  QUERY PLAN
--
 Limit  (cost=0.58..4.38 rows=10 width=16) (actual time=0.069..0.157 rows=10 
loops=1)
   ->  Nested Loop Left Join  (cost=0.58..37926.63 rows=11 width=16) 
(actual time=0.067..0.154 rows=10 loops=1)
 ->  Index Scan using idxv on t1  (cost=0.29..3050.31 rows=11 
width=8) (actual time=0.046..0.053 rows=10 loops=1)
 ->  Index Scan using t2_pkey on t2  (cost=0.29..0.34 rows=1 width=8) 
(actual time=0.007..0.007 rows=1 loops=10)
   Index Cond: (t1.k = k)
 Planning time: 0.537 ms
 Execution time: 0.241 ms
(7 rows)


On 01/30/2016 01:01 AM, Alexander Korotkov wrote:

On Fri, Jan 8, 2016 at 11:58 AM, Konstantin Knizhnik mailto:k.knizh...@postgrespro.ru>> wrote:

Attached please find improved version of the optimizer patch for LIMIT 
clause.
Now I try to apply this optimization only for non-trivial columns requiring 
evaluation.
May be it will be better to take in account cost of this columns evaluation 
but right now I just detect non-variable columns.


We may think about more general feature: LIMIT pushdown. In the Konstantin's 
patch planner push LIMIT before tlist calculation.
But there are other cases when calculating LIMIT first would be beneficial. For 
instance, we can do LIMIT before JOINs. That is possible only for LEFT JOIN 
which is not used in filter and ordering clauses. See the example below.

# create table t1 as (select i, random() v from generate_series(1,100) i);
SELECT 100

# create table t2 as (select i, random() v from generate_series(1,100) i);
SELECT 100

# explain analyze select * from t1 left join t2 on t1.i = t2.i order by t1.v 
limit 10;
  QUERY PLAN

 Limit  (cost=87421.64..87421.67 rows=10 width=24) (actual 
time=1486.276..1486.278 rows=10 loops=1)
   ->  Sort  (cost=87421.64..89921.64 rows=100 width=24) (actual 
time=1486.275..1486.275 rows=10 loops=1)
 Sort Key: t1.v
 Sort Method: top-N heapsort  Memory: 25kB
 ->  Hash Left Join  (cost=27906.00..65812.00 rows=100 width=24) 
(actual time=226.180..1366.238 rows=100
   Hash Cond: (t1.i = t2.i)
 ->  Seq Scan on t1  (cost=0.00..15406.00 rows=100 width=12) (actual 
time=0.010..77.041 rows=100 l
 ->  Hash  (cost=15406.00..15406.00 rows=100 width=12) (actual 
time=226.066..226.066 rows=100 loop
 Buckets: 131072  Batches: 1  Memory Usage: 46875kB
 ->  Seq Scan on t2  (cost=0.00..15406.00 rows=100 width=12) (actual 
time=0.007..89.002 rows=100
 Planning time: 0.123 ms
 Execution time: 1492.118 ms
(12 rows)

# explain analyze select * from (select * from t1 order by t1.v limit 10) t1 
left join t2 on t1.i = t2.i;
  QUERY PLAN

 Hash Right Join  (cost=37015.89..56171.99 rows=10 width=24) (actual 
time=198.478..301.278 rows=10 loops=1)
   Hash Cond: (t2.i = t1.i)
   ->  Seq Scan on t2  (cost=0.00..15406.00 rows=100 width=12) (actual 
time=0.005..74.303 rows=100 loops=1)
   ->  Hash  (cost=37015.77..37015.77 rows=10 width=12) (actual 
time=153.584..153.584 rows=10 loops=1)
 Buckets: 1024  Batches: 1  Memory Usage: 1kB
 ->  Limit  (cost=37015.64..37015.67 rows=10 width=12) (actual 
time=153.579..153.580 rows=10 loops=1)
 ->  Sort  (cost=37015.64..39515.64 rows=100 width=12) (actual 
time=153.578..153.578 rows=10 loops=1)
 Sort Key: t1.v
 Sort Method: top-N heapsort  Memory: 25kB
 ->  Seq Scan on t1  (cost=0.00..15406.00 rows=100 width=12) (actual 
time=0.012..78.828 rows=100
 Planning time: 0.132 ms
 Execution time: 301.308 ms
(12 rows)

In this example LIMIT pushdown makes query 5 times faster. It would be very 
nice if optimizer make this automatically.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com <http://www.postgrespro.com/>
The Russian Postgres Company



--
Konstantin Knizhnik
Postgres Profe

Re: [HACKERS] PATCH: index-only scans with partial indexes

2016-01-31 Thread Konstantin Knizhnik


I am very interested in this patch because it allows to use partial indexes to 
... speed up inserts.
I have implemented "ALTER INDEX ... WHERE ..." construction which allows to 
change predicate of partial index without necessity to fully rebuild it.
So it is not necessary to insert new records in index immediately (if new 
records do not match partial index conditions).
It can be done later in background (or at night). My experiments show that it 
allows to increase insert speed five times (for either partial indexes).
At the same time we do not loose RDBMS requirement that result of query should 
not depend on presence of indexes. And it is applicable to all indexes: B-Tree, 
GIN, GIST,...

But such optimization makes sense only of partial indexes can be used without 
extra overhead, first of all for index-only scans.
And it is impossible without this patch.






On 01/31/2016 03:34 PM, Alvaro Herrera wrote:

Tomas Vondra wrote:


On 12/24/2015 04:05 AM, Michael Paquier wrote:

Tomas, are you still working on that? This thread is stalling for 3 weeks.

I haven't discovered anything interesting during the testing, so I guess the
"needs review" state is appropriate. Let's move the patch to the next
commitfest.

Not sure what to do here, since this patch got no feedback at all in
this CF.  The right thing to do, ISTM, is to just move it again to the
next CF.  But it'd be really useful if someone can have it a look and
verify at least whether it doesn't need a rebase (requiring a further
submission) so that other people can play with it.  Of course, if
Horiguchi-san or anyone has more review comments, that would be even
better.

Tomas said he'd do more testing, but we never got a report on whether
anything turned up.

(At this point I'm not sure if either Kyotaro or Tomas should be
considered the patch author ... maybe both?)




--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: index-only scans with partial indexes

2016-02-02 Thread Konstantin Knizhnik

I have applied this patch to our working branch and  during several 
weeks we ran various tests and benchmarks.

We have not noticed any problems or performance degradation.
And at some queries this patch cause very significant increase of 
performance - ten times:


With this patch:

postgres=# explain analyze select count(*) from t where k1<100 and 
pk < 1454434742881892;

   QUERY PLAN

 Aggregate  (cost=29.65..29.66 rows=1 width=0) (actual 
time=0.108..0.108 rows=1 loops=1)
   ->  Index Only Scan using idx1 on t  (cost=0.43..27.49 rows=861 
width=0) (actual time=0.012..0.071 rows=963 loops=1)

 Index Cond: (k1 < 100)
 Heap Fetches: 0
 Planning time: 0.100 ms
 Execution time: 0.121 ms
(6 rows)


Without patch:

postgres=# explain analyze select count(*) from t where k1<100 and 
pk < 1454434742881892;

   QUERY PLAN

 Aggregate  (cost=2951.55..2951.56 rows=1 width=0) (actual 
time=1.070..1.070 rows=1 loops=1)
   ->  Bitmap Heap Scan on t  (cost=19.10..2949.40 rows=861 width=0) 
(actual time=0.161..0.997 rows=963 loops=1)
 Recheck Cond: ((k1 < 100) AND (pk < 
'1454434742881892'::bigint))

 Heap Blocks: exact=954
 ->  Bitmap Index Scan on idx1  (cost=0.00..18.88 rows=861 
width=0) (actual time=0.083..0.083 rows=963 loops=1)

   Index Cond: (k1 < 100)
 Planning time: 0.099 ms
 Execution time: 1.089 ms
(8 rows)




On 01.02.2016 01:11, Alvaro Herrera wrote:

Konstantin Knizhnik wrote:

I am very interested in this patch because it allows to use partial indexes to 
... speed up inserts.
I have implemented "ALTER INDEX ... WHERE ..." construction which allows to 
change predicate of partial index without necessity to fully rebuild it.
So it is not necessary to insert new records in index immediately (if new 
records do not match partial index conditions).
It can be done later in background (or at night). My experiments show that it 
allows to increase insert speed five times (for either partial indexes).
At the same time we do not loose RDBMS requirement that result of query should 
not depend on presence of indexes. And it is applicable to all indexes: B-Tree, 
GIN, GIST,...

But such optimization makes sense only of partial indexes can be used without 
extra overhead, first of all for index-only scans.
And it is impossible without this patch.

That sounds interesting.  So please review this patch and let us know
whether you like it, or whether you have any better ideas for any
particular hunk, or whether you think it should be rewritten from
scratch, or just state that it is perfect.  If you think it's useful,
then it's a good idea to vouch for it to be integrated; and one way of
doing that is making sure it meets the quality standards etc.  If you
don't say anything about the patch, we may never integrate it because we
might have doubts about whether it's worthy.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Batch update of indexes

2016-02-03 Thread Konstantin Knizhnik


Attached please find patch for "ALTER INDEX ... WHERE ..." clause.
It is now able to handle all three possible situations:
1. Making index partial (add WHERE condition to the ordinary index)
2. Extend partial index range (less restricted index predicate)
3. Arbitrary change of partial index predicate

In case 2) new records are added to the index.
In other two cases index is completely reconstructed.

This patch includes src/bin/insbench utility for testing insert 
performance. It can be easily excluded from the patch to reduce it size.
Also it is better to apply this patch together with "index-only scans 
with partial indexes" patch:


http://www.postgresql.org/message-id/560c7213.3010...@2ndquadrant.com

only in this case regression test will produce expected output.


On 27.01.2016 23:15, Robert Haas wrote:

On Wed, Jan 20, 2016 at 4:28 AM, Konstantin Knizhnik
 wrote:

Please notice that such alter table statement, changing condition for
partial index, is not supported now.
But I do not see any principle problems with supporting such construction.
We should just include in the index all records which match new condition
and do not match old condition:

ts < '21/01/2016' and not (ts < '20/01/2016')

You'd also need to remove any rows from the index that match the old
condition but not the new one.  In your example, that's impossible,
but in general, it's definitely possible.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index b450bcf..b6ffb19 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -32,6 +32,7 @@
 #include "commands/tablespace.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
+#include "funcapi.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
 #include "optimizer/planner.h"
@@ -50,6 +51,9 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
+#include "utils/ruleutils.h"
+#include "executor/executor.h"
+#include "executor/spi.h"
 
 
 /* non-export function prototypes */
@@ -275,6 +279,160 @@ CheckIndexCompatible(Oid oldId,
 	return ret;
 }
 
+static void 
+UpdateIndex(Oid indexRelationId, Node* whereClause)
+{
+	Datum		values[Natts_pg_index];
+	bool		isnull[Natts_pg_index];
+	HeapTuple   oldTuple;
+	HeapTuple   newTuple;
+	Relation	pg_index;
+
+	pg_index = heap_open(IndexRelationId, RowExclusiveLock);
+	oldTuple = SearchSysCacheCopy1(INDEXRELID, ObjectIdGetDatum(indexRelationId));
+	if (!HeapTupleIsValid(oldTuple))
+		elog(ERROR, "cache lookup failed for index %u", indexRelationId);
+
+	heap_deform_tuple(oldTuple, RelationGetDescr(pg_index), values, isnull);
+	values[Anum_pg_index_indpred - 1] = CStringGetTextDatum(nodeToString(whereClause));
+	isnull[Anum_pg_index_indpred - 1] = false;
+	newTuple = heap_form_tuple(RelationGetDescr(pg_index), values, isnull);
+	simple_heap_update(pg_index, &oldTuple->t_self, newTuple);
+	CatalogUpdateIndexes(pg_index, newTuple);
+	heap_freetuple(newTuple);
+	heap_freetuple(oldTuple);
+	heap_close(pg_index, NoLock);
+}
+
+void
+AlterIndex(Oid indexRelationId, IndexStmt *stmt)
+{
+	char* select;
+	Oid heapRelationId;
+	IndexUniqueCheck checkUnique;
+	Datum		values[INDEX_MAX_KEYS];
+	bool		isnull[INDEX_MAX_KEYS];
+	Relation heapRelation;
+	Relation indexRelation;
+SPIPlanPtr plan;
+Portal portal;
+	HeapTuple tuple;
+	TupleTableSlot *slot;
+	ItemPointer tupleid;
+	IndexInfo  *indexInfo;
+	EState *estate;
+	Oid	namespaceId;
+	List*   deparseCtx;
+	char*   oldIndexPredicate;
+	char*   newIndexPredicate;
+	char*   relationName;
+
+	Assert(stmt->whereClause);
+	CheckPredicate((Expr *) stmt->whereClause);
+
+	/* Open and lock the parent heap relation */
+	heapRelationId = IndexGetRelation(indexRelationId, false);
+	heapRelation = heap_open(heapRelationId, AccessShareLock);
+
+	/* Open the target index relation */
+	/*	indexRelation = index_open(indexRelationId, RowExclusiveLock); */
+	indexRelation = index_open(indexRelationId, ShareUpdateExclusiveLock);
+	/* indexRelation = index_open(indexRelationId, AccessShareLock); */
+	namespaceId = RelationGetNamespace(indexRelation);
+
+	indexInfo = BuildIndexInfo(indexRelation);
+	Assert(!indexInfo->ii_ExclusionOps);
+ 
+	/*
+	 * Generate the constraint and default execution states
+	 */
+	estate = CreateExecutorState();
+
+	checkUnique = indexRelation->rd_index->indisunique ? UNIQUE_CHECK_YES : UNIQUE_CHECK_NO;
+
+	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation));
+	
+	deparseCtx = deparse_context_for(RelationGetRelationName(heapRelation), heapRelationId);
+	relationName = quote_qualified_identifier(get_namespace_name(namespaceId),
+

Re: [HACKERS] Batch update of indexes

2016-02-03 Thread konstantin knizhnik


On Feb 4, 2016, at 2:00 AM, Jim Nasby wrote:

> 
> My suspicion is that it would be useful to pre-order the new data before 
> trying to apply it to the indexes.

Sorry, but ALTER INDEX is expected to work for all indexes, not only B-Tree, 
and for them sorting may not be possible...
But for B-Tree presorting inserted data should certainly increase performance. 
I will think about it.



> -- 
> Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
> Experts in Analytics, Data Architecture and PostgreSQL
> Data in Trouble? Get it in Treble! http://BlueTreble.com
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] eXtensible Transaction Manager API (v2)

2016-02-10 Thread Konstantin Knizhnik


Hi,

PostgresProffesional cluster teams wants to propose new version of 
eXtensible Transaction Manager API.

Previous discussion concerning this patch can be found here:

http://www.postgresql.org/message-id/f2766b97-555d-424f-b29f-e0ca0f6d1...@postgrespro.ru

The API patch itself is small enough, but we think that it will be 
strange to provide just API without examples of its usage.


We have implemented various implementations of distributed transaction 
manager based on this API:
pg_dtm (based ion snapshot sharing) and pg_tsdtm (CSN based on local 
system time).
Based on this two DTM implementation we have developed various "cluster" 
implementations:
multimaster+pg_dtm, multimaster+pg_tsdtm, pg_shard+pg_dtm, 
pg_shard+pg_tsdtm, postgres_fdw+pg_dtm, postgres_fdw+pg+tsdtm,... 
Multimaster is based on logical replication is something like BDR but 
synchronous: provide consistency across cluster.


But we want to make this patch as small as possible.
So we decided to include in it only pg_tsdtm and patch of postgres_fdw 
allowing to use it with pg_tsdtm.
pg_tsdtm is simpler than pg_dtm because last one includes arbiter with 
RAFT protocol (centralized service)

and sockhub for efficient multiplexing backend connections.
Also, in theory, pg_tsdtm provides better scalability, because it is 
decentralized.


Architecture of DTM and tsDTM as well as benchmark results are available 
at WiKi page:


https://wiki.postgresql.org/wiki/DTM

Please notice pg-tsdtm is just reference implementation of DTM using 
this XTM API.
The primary idea of this patch is to add XTM API to PostgreSQL code, 
allowing to implement own transaction managers as
Postgres extension. So please review first of all XTM API itself and not 
pg_tsdtm which is just and example of its usage.


The complete PostgreSQL branch with all our changes can be found here:

https://github.com/postgrespro/postgres_cluster


-- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com 
The Russian Postgres Company
diff --git a/contrib/pg_tsdtm/Makefile b/contrib/pg_tsdtm/Makefile
new file mode 100644
index 000..e70dffc
--- /dev/null
+++ b/contrib/pg_tsdtm/Makefile
@@ -0,0 +1,20 @@
+MODULE_big = pg_tsdtm
+OBJS = pg_tsdtm.o
+
+EXTENSION = pg_tsdtm
+DATA = pg_tsdtm--1.0.sql
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/pg_tsdtm
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
+
+check:
+	env DESTDIR='$(abs_top_builddir)'/tmp_install make install
+	$(prove_check)
diff --git a/contrib/pg_tsdtm/dtm_recovery/dtm_recovery.cpp b/contrib/pg_tsdtm/dtm_recovery/dtm_recovery.cpp
new file mode 100644
index 000..38285be
--- /dev/null
+++ b/contrib/pg_tsdtm/dtm_recovery/dtm_recovery.cpp
@@ -0,0 +1,129 @@
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+
+using namespace std;
+using namespace pqxx;
+
+int main (int argc, char* argv[])
+{
+if (argc == 1){
+printf("Use -h to show usage options\n");
+return 1;
+}
+vector connections;
+set prepared_xacts;
+set committed_xacts;
+bool verbose = false;
+for (int i = 1; i < argc; i++) {
+if (argv[i][0] == '-') {
+switch (argv[i][1]) {
+  case 'C':
+  case 'c':
+connections.push_back(string(argv[++i]));
+continue;
+  case 'v':
+verbose = true;
+continue;
+}
+}
+printf("Perform recovery of pg_tsdtm cluster.\n"
+   "Usage: dtm_recovery {options}\n"
+   "Options:\n"
+   "\t-c STR\tdatabase connection string\n"
+   "\t-v\tverbose mode: print extra information while processing\n");
+return 1;
+}
+if (verbose) {
+cout << "Collecting information about prepared transactions...\n";
+}
+for (vector::iterator ic = connections.begin(); ic != connections.end(); ++ic)
+{
+if (verbose) {
+cout << "Connecting to " << *ic << "...\n";
+}
+connection con(*ic);
+work txn(con);
+result r = txn.exec("select gid from pg_prepared_xacts");
+for (result::const_iterator it = r.begin(); it != r.end(); ++it)
+{
+string gid = it.at("gid").as(string());
+prepared_xacts.insert(gid);
+}
+txn.commit();
+}
+if (verbose) {
+cout << "Prepared transactions: ";
+for (set::iterator it = prepared_xacts.begin(); it != prepared_xacts.end(); ++it)
+{
+cout << *it << ", ";
+}
+

[HACKERS] Clock with Adaptive Replacement

2016-02-11 Thread Konstantin Knizhnik


Hi hackers,

What do you think about improving cache replacement clock-sweep algorithm in 
PostgreSQL with adaptive version proposed in this article:

http://www-cs.stanford.edu/~sbansal/pubs/fast04.pdf

Are there some well known drawbacks of this approach or it will be interesting 
to adopt this algorithm to PostgreSQL and measure it impact om performance 
under different workloads?
I find this ten years old thread:

http://www.postgresql.org/message-id/flat/d2jkde$6bg$1...@sea.gmane.org#d2jkde$6bg$1...@sea.gmane.org

but it mostly discus possible patent issues with another algorithm ARC (CAR is 
inspired by ARC,  but it is different algorithm).
As far as I know there are several problems with current clock-sweep algorithm 
in PostgreSQL, especially for very large caches.
May be CAR can address some of them?

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Clock with Adaptive Replacement

2016-02-12 Thread Konstantin Knizhnik

Thank you very much for response.

I am not sure that CART can significantly improve PostgreSQL
performance - I just want to know opinion of community about

CAR/CART and other possible alternative to GCLOCK algorithm.

Looks like it CAR really provides better cache hit ratio and so at some
workloads should increase Postgres performance.
But now amount of memory at servers is large enough to completely keep
most of typical databases in cache.
So time of locating buffer in cache is more critical then time of buffer
eviction.

And here CART doesn't provide any benefits comparing with GCLOCK algorithm.

One of the problems with GCLOCK algorithm from my point of view is that
for large caches, containing larger number of pages locating victim page
can take substantial amount of time, because we have to perform several
turnovers before some count becomes zero. In theory CART can address
this problem because there are not counters - justs single bit per page.

On 12.02.2016 18:55, Robert Haas wrote:

On Thu, Feb 11, 2016 at 4:02 PM, Konstantin Knizhnik
wrote:

What do you think about improving cache replacement clock-sweep algorithm in
PostgreSQL with adaptive version proposed in this article:

http://www-cs.stanford.edu/~sbansal/pubs/fast04.pdf

Are there some well known drawbacks of this approach or it will be
interesting to adopt this algorithm to PostgreSQL and measure it impact om
performance under different workloads?
I find this ten years old thread:

http://www.postgresql.org/message-id/flat/d2jkde$6bg$1...@sea.gmane.org#d2jkde$6bg$1...@sea.gmane.org

but it mostly discus possible patent issues with another algorithm ARC (CAR
is inspired by ARC, but it is different algorithm).
As far as I know there are several problems with current clock-sweep
algorithm in PostgreSQL, especially for very large caches.
May be CAR can address some of them?

Maybe, but the proof of the pudding is in the eating. Just because an
algorithm is smarter, newer, and better in general than our current
algorithm - and really, it wouldn't be hard - doesn't mean that it
will actually solve the problems we care about. A few of my
EnterpriseDB colleagues spent a lot of time benchmarking various
tweaks to our current algorithm last year and were unable to construct
a test case where it sped anything up. If they tried the same tweaks
against the 9.4 source base, they could get a speedup. But 9.5 had
locking improvements around buffer eviction, and with those
improvements committed there was no longer any measurable benefit to
improving the quality of buffer eviction decisions. That's a
surprising result, to me anyway, and somebody else might well find a
test case where a benefit can be shown - but our research was not
successful.

I think it's important to spend time and energy figuring out exactly
what the problems with our current algorithm are. We know in general
terms that usage counts tend to converge to either 5 or 0 and
therefore sometimes evict buffers both at great cost and almost
randomly. But what's a lot less clear is how much that actually hurts
us given that we are relying on the OS cache anyway. It may be that
we need to fix some other things before or after improving the buffer
eviction algorithm before we actually get a performance benefit. I
suspect, for example, that a lot of the problems with large
shared_buffers settings have to do with the bgwriter and checkpointer
behavior rather than with the buffer eviction algorithm; and that
others have to do with cache duplication between PostgreSQL and the
operating system. So, I would suggest (although of course it's up to
you) that you might want to focus on experiments that will help you
understand where the problems are before you plunge into writing code
to fix them.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pglogical - logical replication contrib module

2016-02-17 Thread Konstantin Knizhnik

s some rules like "don't create FKs pointing from non-replicated 
tables to tables replicated from somewhere else". A concept we 
currently have no way to express or enforce like we do 
persistent-to-UNLOGGED FKs.




Then there's global objects. Something as simple as:

CREATE ROLE fred;

CREATE TABLE blah(...) OWNER fred;

will break replication because we only see the CREATE TABLE, not the 
CREATE ROLE. If we instead replayed the CREATE ROLE and there were 
multiple connections between different DBs on an upstream and 
downstream apply would fail on all but one. But we can't anyway since 
there's no way to capture that CREATE ROLE from any DB except the one 
it was executed in, which might not even be one of the ones doing 
replication.


I strongly suspect we'll need logical decoding to be made aware of 
such global DDL and decode it from the WAL writes to the system 
catalogs. Which will be fun - but at least modifications to the shared 
catalogs are a lot simpler than the sort of gymnastics done by ALTER 
TABLE, etc.




--
 Craig Ringer http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] pglogical - logical replication contrib module

2016-02-17 Thread Konstantin Knizhnik


Ok, what about the following plan:

1. Support custom WAL records (as far as I know 2ndQuadrant has such patch).
2. Add one more function to logical decoding allowing to deal with 
custom records.


So the idea is that we somehow record DDL in WAL (for example using 
executor hook),
then them are proceeded using logical decoding, calling special logical 
deocding plugin function to handle this records.
For example we can store DDL in WAL just as SQL statements and so easily 
replay them.


In this case DDL will be replicated using the same mechanism and through 
the same channel as DML.



On 17.02.2016 12:16, Craig Ringer wrote:
On 17 February 2016 at 16:24, Konstantin Knizhnik 
mailto:k.knizh...@postgrespro.ru>> wrote:


Thanks for your explanation. I have to agree with your arguments
that in general case replication of DDL statement using logical
decoding seems to be problematic. But we are mostly considering
logical decoding in quite limited context: replication between two
identical Postgres database nodes (multimaster).


Yep, much like BDR. Where all this infrastructure came from and is/was 
aimed at.


Do you think that it in this case replication of DLL can be done
as sequence of low level operations with system catalog tables
including manipulation with locks?


No.

For one thing logical decoding doesn't see catalog tuple changes right 
now. Though I imagine that could be changed easily enough.


More importantly - oids. You add a column to a table:

ALTER TABLE mytable ADD COLUMN mycolumn some_type UNIQUE NOT NULL 
DEFAULT some_function()


This writes to catalogs including:

pg_attribute
pg_constraint
pg_index
pg_class (for the index relation)

... probably more. It also refers to pg_class (for the definition of 
mytable), pg_type (definition of some_type), pg_proc (definition of 
some_function), the b-tree operator class for some_type in pg_opclass, 
the b-tree indexam in pg_am, ... more.


Everything is linked by oids, and the oids are all node local. You 
can't just blindly re-use them. If "some_type" is hstore, the oid of 
hstore in pg_type might be different on the upstream and downstream. 
The only exception is the oids of built-in types and even then that's 
not guaranteed across major versions.


So if you blindly replicate catalog row changes you'll get a horrible 
mess. That's before considering a table's relfilenode, which is 
initially the same as its oid, but subject to change if truncated or 
rewritten.


To even begin to do this half-sanely you'd have to maintain a mapping 
of upstream object oids->names on the downstream, with invalidations 
replicated from the upstream. That's only the beginning. There's 
handling of extensions and lots more fun.


So in your example with ALTER TABLE statement, can we correctly
replicate it to other nodes
as request to set exclusive lock + some manipulations with catalog
tables and data table itself?


Nope. No hope, not unless "some manipulations with catalog tables and 
data table its self" is a lot more comprehensive than I think you mean.


1. Add option whether to include operations on system catalog
tables in logical replication or not.


I would like to have this anyway.

2. Make it possible to replicate lock requests (can be useful not
only for DDLs)


I have no idea how you'd even begin to do that.

I looked how DDL was implemented in BDR and did it in similar way
in our multimaster.
But it is awful: we need to have two different channels for
propagating changes.


Yeah, it's not beautiful, but maybe you misunderstood something? The 
DDL is written to a table, and that table's changes are replayed along 
with everything else. It's consistent and keeps DDL changes as part of 
the xact that performed them. Maybe you misunderstood how it works in 
BDR and missed the indirection via a table?


Additionally, in multimaster we want to enforce cluster wide ACID.
It certainly includes operations with metadata. It will be very
difficult to implement if replication of DML and DDL is done in
two different ways...


That's pretty much why BDR does it this way, warts and all. Though it 
doesn't offer cluster-wide ACID it does need atomic commit of xacts 
that may contain DML, DDL, or some mix of the two.


Let me ask one more question concerning logical replication: how
difficult it will be from your point of view to support two phase
commit in logical replication? Are there some principle problems?


I haven't looked closely yet. Andres will know more.

I very, very badly want to be able to decode 2PC prepared xacts myself.

The main issue I'm aware of is locking - specifically the inability to 
impersonate another backend and treat locks held by that backend 
(which might be a fake backend for a pg_prepared_xacts entry)

Re: [HACKERS] The plan for FDW-based sharding

2016-02-24 Thread Konstantin Knizhnik

Once that is done, we can see what workloads it covers and
decide if we are willing to copy the volume of code necessary
to implement all supported Postgres XC or XL workloads.
(The Postgres XL license now matches the Postgres license,
http://www.postgres-xl.org/2015/07/license-change-and-9-5-merge/.
Postgres XC has always used the Postgres license.)

If we are not willing to add code for the missing Postgres XC/XL
features, Postgres XC/XL will probably remain a separate fork of
Postgres.  I don't think anyone knows the answer to this question, and I
don't know how to find the answer except to keep going with our current
FDW sharding approach.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Relation cache invalidation on replica

The reason of the problem is that invalidation messages are not 
delivered to replica after the end of concurrent create index.
Invalidation messages are included in xlog as part of transaction commit 
record.
Concurrent index create is split into three transaction, last of which 
is just performing inplace update of index tuple, marking it as valid 
and invalidating cache. But as far as this transaction is not assigned 
XID, no transaction record is created in WAL and send to replicas. As a 
result, replica doesn't receive this invalidation messages.


To fix the problem it is just enough to assign XID to transaction.
It can be done by adding GetCurrentTransactionId() call to the end of 
DefineIdnex function:


diff --git a/src/backend/commands/indexcmds.c 
b/src/backend/commands/indexcmds.c

index 13b04e6..1024603 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -881,6 +881,12 @@ DefineIndex(Oid relationId,
 CacheInvalidateRelcacheByRelid(heaprelid.relId);

 /*
+ * Force WAL commit record to ensure that replica receives invalidation
+ * messages.
+ */
+GetCurrentTransactionId();
+
+/*
  * Last thing to do is release the session-level lock on the 
parent table.

  */
 UnlockRelationIdForSession(&heaprelid, ShareUpdateExclusiveLock);




On 26.02.2016 15:41, Васильев Дмитрий wrote:
Session opened on replica doesn't see concurrently created indexes at 
this time on master.


We have master and replica:

1. master: pgbench -i -s 10

2. replica:
explain (analyze,verbose) select * from pgbench_accounts where 
abalance = 1;


3. master:
ALTER INDEX pgbench_accounts_abalance_idx RENAME TO 
pgbench_accounts_abalance_idx_delme;
CREATE INDEX CONCURRENTLY pgbench_accounts_abalance_idx ON 
pgbench_accounts USING btree (abalance);

DROP INDEX pgbench_accounts_abalance_idx_delme;

4. at this time on replica:

explain (analyze,verbose) select * from pgbench_accounts where 
abalance = 1;
pgbench=# explain (analyze,verbose) select * from pgbench_accounts 
where abalance = 1;

QUERY PLAN

Index Scan using pgbench_accounts_abalance_idx on 
public.pgbench_accounts (cost=0.42..4.44 rows=1 width=97) (actual 
time=655.781..655.781 rows=0 loops=1)

Output: aid, bid, abalance, filler
Index Cond: (pgbench_accounts.abalance = 1)
Planning time: 9388.259 ms
Execution time: 655.900 ms
(5 rows)

pgbench=# explain (analyze,verbose) select * from pgbench_accounts 
where abalance = 1;

QUERY PLAN
--
Index Scan using pgbench_accounts_abalance_idx_delme on 
public.pgbench_accounts (cost=0.42..4.44 rows=1 width=97) (actual 
time=0.014..0.014 rows=0 loops=1)

Output: aid, bid, abalance, filler
Index Cond: (pgbench_accounts.abalance = 1)
Planning time: 0.321 ms
Execution time: 0.049 ms
(5 rows)

pgbench=# explain (analyze,verbose) select * from pgbench_accounts 
where abalance = 1;

QUERY PLAN

Seq Scan on public.pgbench_accounts (cost=0.00..28894.00 rows=1 
width=97) (actual time=3060.451..3060.451 rows=0 loops=1)

Output: aid, bid, abalance, filler
Filter: (pgbench_accounts.abalance = 1)
Rows Removed by Filter: 100
Planning time: 0.087 ms
Execution time: 3060.484 ms
(6 rows)

pgbench=# \d+ pgbench_accounts
Table "public.pgbench_accounts"
Column | Type | Modifiers | Storage | Stats target | Description
--+---
aid | integer | not null | plain | |
bid | integer | | plain | |
abalance | integer | | plain | |
filler | character(84) | | extended | |
Indexes:
"pgbench_accounts_pkey" PRIMARY KEY, btree (aid)
"pgbench_accounts_abalance_idx" btree (abalance)
Options: fillfactor=100

New opened session successfully uses this index.
Tested with PostgreSQL 9.5.1.

---
Dmitry Vasilyev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] The plan for FDW-based sharding

We do not have formal prove that proposed XTM is "general enough" to 
handle all possible transaction manager implementations.
But there are two general ways of dealing with isolation: snapshot based 
and CSN  based.

pg_dtm and pg_tsdtm prove that both of them can be implemented using XTM.
If you know some approach to distributed transaction manager 
implementation, please let us know.

Otherwise your statement "is not general enough" is not concrete enough.
Postgres-XL GTM can be in principle implemented as extension based on XTM.

This API is based on existed PostgreSQL TM functions: we do not 
introduce some new abstractions.
Is it possible that some other TM function has to be encapsulated? Yes, 
it is.
But I do not see much problems with adding this function to XTM in 
future if it is actually needed.
It happens with most APIs. It is awful when API functions are changed, 
breaking application based on this API.
But as far as functions encapsulated in XTM are in any case present in 
PostgreSQL core, I do not think
that them will be changed in future unless there are some plans to 
completely rewrite Postgres transaction manager...


Yes, it is certainly possible to develop cluster by cloning PostgreSQL.
But it cause big problems both for developers, which have to permanently 
synchronize their branch with master,
and, what is more important, for customers, which can not use standard 
version of PostgreSQL.
It may cause problems with system certification, with running Postgres 
in cloud,...
Actually the history of Postgres-XL/XC and Greenplum IMHO shows that it 
is wrong direction.




On 26.02.2016 19:06, Robert Haas wrote:

On Fri, Feb 26, 2016 at 7:21 PM, Oleg Bartunov  wrote:

Right now tm is hardcoded and it's doesn't matter  "if other people might
need" at all.  We at least provide developers ("other people")  ability to
work on their implementations and the patch  is safe and doesn't sacrifices
anything in core.

I don't believe that.  When we install APIs into core, we're
committing to keep those APIs around.  And I think that we're far too
early in the development of transaction managers for PostgreSQL to
think that we know what APIs we want to commit to over the long term.


And what makes us think we
really need multiple transaction managers, anyway?

If you brave enough to say that one tm-fits-all and you are able to teach
existed tm to play well  in various clustering environment during
development period, which is short, than probably we don't need  multiple
tms. But It's too perfect to believe and practical solution is to let
multiple groups to work on their solutions.

Nobody's preventing multiple groups for working on their solutions.
That's not the question.  The question is why we should install hooks
in core at this early stage without waiting to see which
implementations prove to be best and whether those hooks are actually
general enough to cater to everything people want to do.  There is
talk of integrating XC/XL work into PostgreSQL; it has a GTM.
Postgres Pro has several GTMs.  Maybe there will be others.

Frankly, I'd like to see a GTM in core at some point because I'd like
everybody who uses PostgreSQL to have access to a GTM.  What I don't
want is for every PostgreSQL company to develop its own GTM and
distribute it separately from everybody else's.  IIUC, MySQL kinda did
that with storage engines and it resulted in the fragmentation of the
community.  We've had the same thing happen with replication tools -
every PostgreSQL company develops their own set.  It would have been
better to have ONE set that was distributed by the core project so
that we didn't all do the same work over again.

I don't understand the argument that without these hooks in core,
people can't continue to work on this.  It isn't hard to work on GTM
without any core changes at all.  You just patch your copy of
PostgreSQL.  We do this all the time, for every patch.  We don't add
hooks for every patch.


dtms.  It's time to start working on dtm, I believe. The fact you don't
think about distributed transactions support doesn't mean there no "other
people", who has different ideas on postgres future.  That's why we propose
this patch, let's play the game !

I don't like to play games with the architecture of PostgreSQL.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] The plan for FDW-based sharding

On 02/26/2016 09:30 PM, Alvaro Herrera wrote:

Konstantin Knizhnik wrote:

Yes, it is certainly possible to develop cluster by cloning PostgreSQL.
But it cause big problems both for developers, which have to permanently
synchronize their branch with master,
and, what is more important, for customers, which can not use standard
version of PostgreSQL.
It may cause problems with system certification, with running Postgres in
cloud,...
Actually the history of Postgres-XL/XC and Greenplum IMHO shows that it is
wrong direction.

That's not the point, though. I don't think a Postgres clone with a GTM
solves any particular problem that's not already solved by the existing
forks. However, if you have a clone at home and you make a GTM work on
it, then you take the GTM as a patch and post it for discussion.
There's no need for hooks for that. Just make sure your GTM solves the
problem that it is supposed to solve.

Excuse me if I've missed the discussion elsewhere -- why does
PostgresPro have *two* GTMs instead of a single one?

There are many different clusters which require different approaches for
managing distributed transactions.
Some clusters do no need distributed transactions at all: if you are executing
OLAP queries on read-only database GTM will just add extra overhead.

pg_dtm uses centralized arbiter. It is similar with Postgres-XL DTM. Presence of single arbiter signficantly simplify all distributed algorithms: failure detection, global deadlock elimination, ... But at the same time arbiter is SPOF and main factor
limiting cluster scalability.

pg_tsdtm is based on another approach: it is using system time as CSN and doesn't require arbiter. In theory there is no limit for scalability. But differences in system time and necessity to use more rounds of communication have negative impact on
performance.

So there is no ideal solution which can work well for all cluster. This is why it is not possible to develop just one GTM, propose it as a patch for review and then (hopefully) commit it in Postgres core. IMHO it will never happen. And I do not think that
it is actually needed. What we need is a way to be able to create own transaction managers as Postgres extension not affecting its core.

All arguments against XTM can be applied to any other extension API in
Postgres, for example FDW.
Is it general enough? There are many useful operations which currently are not handled by this API. For example performing aggregation and grouping at foreign server side. But still it is very useful and flexible mechanism, allowing to implement many
wonderful things.

From my point of view good system should be as open and customizable as
possible, if it doesn't affect performance.
Replacing direct function calls with indirect function calls in almost all
cases can not suffer performance as well as adding hooks.
So without any extra price we get better flexibility. What's wrong with it?

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Logical decoding restart problems

2016-08-19 Thread konstantin knizhnik

Hi,

We are using logical decoding in multimaster and we are faced with the problem 
that inconsistent transactions are sent to replica.
Briefly, multimaster is using logical decoding in this way:
1. Each multimaster node is connected with each other using logical decoding 
channel and so each pair of nodes 
has its own replication slot.
2. In normal scenario each replication channel is used to replicate only those 
transactions which were originated at the source node.
We are using origin mechanism to skip "foreign" transactions.
2. When offline cluster node is returned back to the multimaster we need to 
recover this node to the current cluster state.
Recovery is performed from one of the cluster's node. So we are using only one 
replication channel to receive all (self and foreign) transactions.
Only in this case we can guarantee consistent order of applying transactions at 
recovered node.
After the end of recovery we need to recreate replication slots with all other 
cluster nodes (because we have already replied transactions from this nodes).
To restart logical decoding we first drop existed slot, then create new one and 
then start logical replication from the WAL position 0/0 (invalid LSN).
In this case recovery should be started from the last consistent point.

The problem is that for some reasons consistent point is not so consistent and 
we get partly decoded transactions.
I.e. transaction body consists of two UPDATE but reorder_buffer extracts only 
the one (last) update and sent this truncated transaction to destination 
causing consistency violation at replica.  I started investigation of logical 
decoding code and found several things which I do not understand.

Assume that we have transactions T1={start_lsn=100, end_lsn=400} and 
T2={start_lsn=200, end_lsn=300}.
Transaction T2 is sent to the replica and replica confirms that flush_lsn=300.
If now we want to restart logical decoding, we can not start with position less 
than 300, because CreateDecodingContext doesn't allow it:

 * start_lsn
 *  The LSN at which to start decoding.  If InvalidXLogRecPtr, 
restart
 *  from the slot's confirmed_flush; otherwise, start from the 
specified
 *  location (but move it forwards to confirmed_flush if it's older 
than
 *  that, see below).
 *
else if (start_lsn < slot->data.confirmed_flush)
{
/*
 * It might seem like we should error out in this case, but it's
 * pretty common for a client to acknowledge a LSN it doesn't 
have to
 * do anything for, and thus didn't store persistently, because 
the
 * xlog records didn't result in anything relevant for logical
 * decoding. Clients have to be able to do that to support 
synchronous
 * replication.
 */

So it means that we have no chance to restore T1?
What is worse, if there are valid T2 transaction records with lsn >= 300, then 
we can partly decode T1 and send this T1' to the replica.
I missed something here?

Are there any alternative way to "seek" slot to the proper position without  
actual fetching data from it or recreation of the slot?
Is there any mechanism in xlog which can enforce consistent decoding of 
transaction (so that no transaction records are missed)?
May be I missed something but I didn't find any "record_number" or something 
else which can identify first record of transaction.

Thanks in advance,
Konstantin Knizhnik,
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: [HACKERS] Logical decoding restart problems

2016-08-19 Thread Konstantin Knizhnik

EATE_REPLICATION_SLOT "mtm_slot_1" LOGICAL "multimaster";
START_REPLICATION SLOT "mtm_slot_1" LOGICAL 0/0 ("startup_params_format" 
'1', "max_proto_version" '1', "min_proto_version" '1', 
"forward_changesets" '1', "mtm_replication_mode" 'recovered');


I have also tried to calculate last origin LSN for this node and 
explicitly specify it in START_REPLICATION.

But it doesn't help: the same problem persists.





--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical decoding restart problems

2016-08-20 Thread konstantin knizhnik

Thank you for answers.

> No, you don't need to recreate them. Just advance your replication identifier 
> downstream and request a replay position in the future. Let the existing slot 
> skip over unwanted data and resume where you want to start replay.
> 
> You can advance the replication origins on the peers as you replay forwarded 
> xacts from your master.
> 
> Have a look at how the BDR code does this during "catchup mode" replay.
>  
> So while your problem discussed below seems concerning, you don't have to 
> drop and recreate slots like are currently doing. 

The only reason for recreation of slot is that I want to move it to the current 
"horizont" and skip all pending transaction without explicit specification of 
the restart position.
If I do not drop the slot and just restart replication specifying position 0/0 
(invalid LSN), then replication will be continued from the current slot 
position in WAL, will not it?
So there  is no way to specify something "start replication from the end of 
WAL", like lseek(0, SEEK_END).
Right now I trying to overcome this limitation by explicit calculation of the 
position from which we should continue replication.
But unfortunately the problem  with partly decoded transactions persist.
I will try at next week to create example reproducing the problem without any 
multimaster stuff, just using standard logical decoding plugin.


> 
> To restart logical decoding we first drop existed slot, then create new one 
> and then start logical replication from the WAL position 0/0 (invalid LSN).
> In this case recovery should be started from the last consistent point.
> 
> How do you create the new slot? SQL interface? walsender interface? Direct C 
> calls?

Slot is created by peer node using standard libpq connection with 
database=replication connection string.

>  
> The problem is that for some reasons consistent point is not so consistent 
> and we get partly decoded transactions.
> I.e. transaction body consists of two UPDATE but reorder_buffer extracts only 
> the one (last) update and sent this truncated transaction to destination 
> causing consistency violation at replica.  I started investigation of logical 
> decoding code and found several things which I do not understand.
> 
> Yeah, that sounds concerning and shouldn't happen.

I looked at replication code more precisely and understand that my first 
concerns were wrong.
Confirming flush position should not prevent replaying transactions with 
smaller LSNs.
But unfortunately the problem is really present. May be it is caused by race 
conditions (although most logical decoder data is local to backend).
This is why I will try to create reproducing scenario without multimaster.

> 
> Assume that we have transactions T1={start_lsn=100, end_lsn=400} and 
> T2={start_lsn=200, end_lsn=300}.
> Transaction T2 is sent to the replica and replica confirms that flush_lsn=300.
> If now we want to restart logical decoding, we can not start with position 
> less than 300, because CreateDecodingContext doesn't allow it:
> 
> 
> Right. You've already confirmed receipt of T2, so you can't receive it again.
>  
> So it means that we have no chance to restore T1?
> 
> Wrong. You can, because the slot's restart_lsn still be will be some LSN <= 
> 100. The slot keeps track of inprogress transactions (using xl_running_xacts 
> records) and knows it can't discard WAL past lsn 100 because xact T1 is still 
> in-progress, so it must be able to decode from the start of it.
> 
> When you create a decoding context decoding starts at restart_lsn not at 
> confirmed_flush_lsn. confirmed_flush_lsn is the limit at which commits start 
> resulting in decoded data being sent to you. So in your case, T1 commits at 
> lsn=400, which is >300, so you'll receive the whole xact for T1.

Yeh, but unfortunately it happens. Need to understand why...

>  
> It's all already there. See logical decoding's use of xl_running_xacts.


But how this information is persisted?
What will happen if wal_sender is restarted?

> 
> -- 
>  Craig Ringer   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] UPSERT strange behavior

2016-08-25 Thread Konstantin Knizhnik


On 08/25/2016 10:08 PM, Peter Geoghegan wrote:

On Thu, Aug 25, 2016 at 11:49 AM, Tom Lane  wrote:

I think the point is that given the way he's set up the test case,
there should be no duplicate violation in the plain unique index
unless there is one in the arbiter index.  So assuming that INSERT
tests the arbiter indexes first, there shouldn't be an error.
Maybe it doesn't do that, but it seems like it would be a good idea
if it did.

Oh, yeah. This is arguably an example of inference failing to infer
multiple unique indexes as arbiters. Inference could, in principle,
recognize that the second unique index is equivalent to the first, but
doesn't. (I don't think that it matters which order anything is tested
in, though, because not finding a dup value in the arbiter index does
not guarantee that there won't be one in the other index. There is no
locking when no conflict is initially found, and so no guarantees
here.)

Anyway, I don't have a lot of sympathy for this point of view, because
the scenario is completely contrived. You have to draw the line
somewhere.


I do not think that this scenario is completely contrived: the cases when a 
table has more than one primary key are quite common.
For example, "user" may have unique e-mail address, phone number and login.
Also, as far as I know, this is not an artificial example, but real case taken 
from customers application...

I am not sure weather it's really bug or feature, but the user's intention was 
obvious: locate record by one of the unique keys and if such record already 
exists,
then increment counter (do update instead of insert).  But there are also good 
arguments why upsert  may report conflict in this case...

If such UPSERT behavior is assumed to be correct, what is the best workaround 
for the problem if we really need to have to separate indexes and want to 
enforce unique constraint for both keys?

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Handling dropped attributes in pglogical_proto

2016-09-28 Thread Konstantin Knizhnik


Hi,

pglogical_read_tuple from pglogical_proto.c contains the following code:

natts = pq_getmsgint(in, 2);
if (rel->natts != natts)
elog(ERROR, "tuple natts mismatch, %u vs %u", rel->natts, 
natts);



But if table was just altered and some attribute was removed from the 
table, then rel->natts can be greater than natts.

The problem can be fixed by the following patch:

--- a/pglogical_proto.c
+++ b/pglogical_proto.c
@@ -263,10 +263,15 @@ pglogical_read_tuple(StringInfo in, PGLogicalRelation 
*rel,
{
int attid = rel->attmap[i];
Form_pg_attribute att = desc->attrs[attid];
-   charkind = pq_getmsgbyte(in);
+   charkind;
const char *data;
int len;
 
+   if (att->atttypid == InvalidOid) {

+   continue;
+   }
+   kind = pq_getmsgbyte(in);
+
switch (kind)
{
    case 'n': /* null */


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Issue with logical replication: MyPgXact->xmin already is valid

2017-10-06 Thread Konstantin Knizhnik

When creating logical replication slots we quite often get the following 
error:


ERROR:  cannot build an initial slot snapshot when MyPgXact->xmin 
already is valid


which cause restart of WAL sender.
The comment to this line doesn't clarify much:

/* so we don't overwrite the existing value */
if (TransactionIdIsValid(MyPgXact->xmin))
elog(ERROR, "cannot build an initial slot snapshot when 
MyPgXact->xmin already is valid");



I have checked that MyPgXact->xmin is set by GetSnapshotData:

#3  0x0086025e in GetSnapshotData (snapshot=0xf28040 
) at procarray.c:1714
#4  0x00a48523 in GetNonHistoricCatalogSnapshot (relid=2615) at 
snapmgr.c:479

#5  0x00a484d3 in GetCatalogSnapshot (relid=2615) at snapmgr.c:452
#6  0x004f15bf in systable_beginscan (heapRelation=0x256bdb0, 
indexId=2684, indexOK=1 '\001', snapshot=0x0, nkeys=1, 
key=0x7ffdf633c770) at genam.c:353
#7  0x009d676e in SearchCatCache (cache=0x25230a8, v1=39586984, 
v2=0, v3=0, v4=0) at catcache.c:1169
#8  0x009ec13d in SearchSysCache (cacheId=35, key1=39586984, 
key2=0, key3=0, key4=0) at syscache.c:1109
#9  0x009ec259 in GetSysCacheOid (cacheId=35, key1=39586984, 
key2=0, key3=0, key4=0) at syscache.c:1187
#10 0x00574b85 in get_namespace_oid (nspname=0x25c0ca8 
"pg_catalog", missing_ok=1 '\001') at namespace.c:3009
#11 0x00574886 in LookupExplicitNamespace (nspname=0x25c0ca8 
"pg_catalog", missing_ok=1 '\001') at namespace.c:2871
#12 0x0057437d in get_ts_config_oid (names=0x25c2438, 
missing_ok=1 '\001') at namespace.c:2653
#13 0x009f56ca in check_TSCurrentConfig (newval=0x7ffdf633cb90, 
extra=0x7ffdf633cba0, source=PGC_S_FILE) at ts_cache.c:600
#14 0x00a1bbdc in call_string_check_hook (conf=0xf26870 
, newval=0x7ffdf633cb90, 
extra=0x7ffdf633cba0, source=PGC_S_FILE, elevel=12) at guc.c:9912
#15 0x00a14420 in parse_and_validate_value (record=0xf26870 
, name=0x25c0620 
"default_text_search_config", value=0x25c0658 "pg_catalog.english", 
source=PGC_S_FILE, elevel=12,

newval=0x7ffdf633cb90, newextra=0x7ffdf633cba0) at guc.c:5840
#16 0x00a15543 in set_config_option (name=0x25c0620 
"default_text_search_config", value=0x25c0658 "pg_catalog.english", 
context=PGC_SIGHUP, source=PGC_S_FILE, action=GUC_ACTION_SET, 
changeVal=1 '\001', elevel=12,

is_reload=0 '\000') at guc.c:6442
#17 0x00a1eb5c in ProcessConfigFileInternal (context=PGC_SIGHUP, 
applySettings=1 '\001', elevel=13) at guc-file.l:439
#18 0x00a1e4ee in ProcessConfigFile (context=PGC_SIGHUP) at 
guc-file.l:155
#19 0x0082433c in WalSndWaitForWal (loc=25991904) at 
walsender.c:1309
#20 0x00823423 in logical_read_xlog_page (state=0x25a4f10, 
targetPagePtr=25985024, reqLen=6880, targetRecPtr=25991880, 
cur_page=0x25a6c60 "\227\320\005", pageTLI=0x25a57bc) at walsender.c:761
#21 0x00558c3d in ReadPageInternal (state=0x25a4f10, 
pageptr=25985024, reqLen=6880) at xlogreader.c:556
#22 0x00558405 in XLogReadRecord (state=0x25a4f10, 
RecPtr=25991880, errormsg=0x7ffdf633cea8) at xlogreader.c:255
#23 0x0080dda6 in DecodingContextFindStartpoint (ctx=0x25a4c50) 
at logical.c:450
#24 0x00823a0c in CreateReplicationSlot (cmd=0x24dc398) at 
walsender.c:934
#25 0x008247e4 in exec_replication_command (cmd_string=0x254e8f0 
"CREATE_REPLICATION_SLOT \"shardman_copy_t_10_3_4_17307_sync_17302\" 
TEMPORARY LOGICAL pgoutput USE_SNAPSHOT") at walsender.c:1515
#26 0x0088eccc in PostgresMain (argc=1, argv=0x24f0b28, 
dbname=0x24f0948 "postgres", username=0x24bf7f0 "knizhnik") at 
postgres.c:4086

#27 0x007ef9e2 in BackendRun (port=0x24dee00) at postmaster.c:4357
#28 0x007ef10c in BackendStartup (port=0x24dee00) at 
postmaster.c:4029

#29 0x007eb6cc in ServerLoop () at postmaster.c:1753
#30 0x007eacb8 in PostmasterMain (argc=3, argv=0x24bd660) at 
postmaster.c:1361

#31 0x00728593 in main (argc=3, argv=0x24bd660) at main.c:228


I wonder if it is normal situation or something goes wrong?





--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Issue with logical replication: MyPgXact->xmin already is valid

2017-10-06 Thread Konstantin Knizhnik




On 06.10.2017 15:29, Petr Jelinek wrote:

On 06/10/17 12:16, Konstantin Knizhnik wrote:

When creating logical replication slots we quite often get the following
error:

ERROR:  cannot build an initial slot snapshot when MyPgXact->xmin
already is valid

which cause restart of WAL sender.
The comment to this line doesn't clarify much:

 /* so we don't overwrite the existing value */
 if (TransactionIdIsValid(MyPgXact->xmin))
 elog(ERROR, "cannot build an initial slot snapshot when
MyPgXact->xmin already is valid");
I wonder if it is normal situation or something goes wrong?


Hi,

no it's not normal situation, it seems you are doing something that
assigns xid before you run the CREATE_REPLICATION_SLOT command on that
connection.


I have not doing something in this connection: it is wal sender 
executing CREATE_REPLICATION_SLOT replication command.
Please look at the stack in my original e-mail. It shows who and when is 
setting MyPgXact->xmin.

It is GetSnapshotData called because of reloading configuration settings.
And configuration setting are reloaded because our application is 
updating "synchronous_standby_names".



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Issue with logical replication: MyPgXact->xmin already is valid

2017-10-07 Thread Konstantin Knizhnik


On 10/07/2017 04:26 PM, Petr Jelinek wrote:


Hmm so you start transaction (you have to when running
CREATE_REPLICATION_SLOT with USE_SNAPSHOT parameter). And while the slot
is being created the config is reloaded. And since now you are in
transaction the tsearch hook for GUC processing tries to access catalogs
which sets the xmin for the transaction.


Actually this slot is implicitly created by LogicalRepSyncTableStart to perform 
initial data sync.



That's not good, but I can't really say I have idea about what to do
with it other than to set some kind of flag saying that logical decoding
snapshot is being built and using that to skip catalog access which does
not seem very pretty.


It is not quite clear from the comment why it is not possible to overwrite 
MyPgXact->xmin or just use existed value.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Slow synchronous logical replication

2017-10-07 Thread konstantin knizhnik

In our sharded cluster project we are trying to use logical relication for 
providing HA (maintaining redundant shard copies).
Using asynchronous logical replication has not so much sense in context of HA. 
This is why we try to use synchronous logical replication.
Unfortunately it shows very bad performance. With 50 shards and level of 
redundancy=1 (just one copy) cluster is 20 times slower then without logical 
replication.
With asynchronous replication it is "only" two times slower.

As far as I understand, the reason of such bad performance is that synchronous 
replication mechanism was originally developed for streaming replication, when 
all replicas have the same content and LSNs. When it is used for logical 
replication, it behaves very inefficiently. Commit has to wait confirmations 
from all receivers mentioned in "synchronous_standby_names" list. So we are 
waiting not only for our own single logical replication standby, but all other 
standbys as well. Number of synchronous standbyes is equal to number of shards 
divided by number of nodes. To provide uniform distribution number of shards 
should >> than number of nodes, for example for 10 nodes we usually create 100 
shards. As a result we get awful performance and blocking of any replication 
channel blocks all backends.

So my question is whether my understanding is correct and synchronous logical 
replication can not be efficiently used in such manner.
If so, the next question is how difficult it will be to make synchronous 
replication mechanism for logical replication more efficient and are there some 
plans to  work in this direction?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Slow synchronous logical replication

2017-10-07 Thread Konstantin Knizhnik

On 10/07/2017 10:42 PM, Andres Freund wrote:

Hi,

On 2017-10-07 22:39:09 +0300, konstantin knizhnik wrote:

In our sharded cluster project we are trying to use logical relication for
providing HA (maintaining redundant shard copies).
Using asynchronous logical replication has not so much sense in context of HA.
This is why we try to use synchronous logical replication.
Unfortunately it shows very bad performance. With 50 shards and level of
redundancy=1 (just one copy) cluster is 20 times slower then without logical
replication.
With asynchronous replication it is "only" two times slower.

As far as I understand, the reason of such bad performance is that synchronous replication
mechanism was originally developed for streaming replication, when all replicas have the same
content and LSNs. When it is used for logical replication, it behaves very inefficiently.
Commit has to wait confirmations from all receivers mentioned in
"synchronous_standby_names" list. So we are waiting not only for our own single
logical replication standby, but all other standbys as well. Number of synchronous standbyes is
equal to number of shards divided by number of nodes. To provide uniform distribution number of
shards should >> than number of nodes, for example for 10 nodes we usually create 100
shards. As a result we get awful performance and blocking of any replication channel blocks all
backends.

So my question is whether my understanding is correct and synchronous logical
replication can not be efficiently used in such manner.
If so, the next question is how difficult it will be to make synchronous
replication mechanism for logical replication more efficient and are there some
plans to work in this direction?

This seems to be a question that is a) about a commercial project we
don't know much about b) hasn't received a lot of investigation.

Sorry, If I was not clear.
The question was about logical replication mechanism in mainstream version of
Postgres.
I think that most of people are using asynchronous logical replication and
synchronous LR is something exotic and not well tested and investigated.
It will be great if I am wrong:)

Concerning our sharded cluster (pg_shardman) - it is not a commercial product
yet, it is in development phase.
We are going to open its sources when it will be more or less stable.
But unlike multimaster, this sharded cluster is mostly built from existed
components: pg_pathman + postgres_fdw + logical replication.
So we are just trying to combine them all into some integrated system.
But currently the most obscure point is logical replication.

And the main goal of my e-mail was to know the opinion of authors and users of
LR whether it is good idea to use LR to provide fault tolerance in sharded
cluster.
Or some other approaches, for example sharding with redundancy or using
streaming replication are preferable?

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Slow synchronous logical replication

2017-10-09 Thread Konstantin Knizhnik


Thank you for explanations.

On 08.10.2017 16:00, Craig Ringer wrote:

I think it'd be helpful if you provided reproduction instructions,
test programs, etc, making it very clear when things are / aren't
related to your changes.


It will be not so easy to provide some reproducing scenario, because 
actually it involves many components (postgres_fdw, pg_pasthman, 
pg_shardman, LR,...)

and requires multinode installation.
But let me try to explain what going on:
So we have implement sharding - splitting data between several remote 
tables using pg_pathman and postgres_fdw.
It means that insert or update of parent table  cause insert or update 
of some derived partitions which is forwarded by postgres_fdw to the 
correspondent node.
Number of shards is significantly larger than number of nodes, i.e. for 
5 nodes we have 50 shards. Which means that at each onde we have 10 shards.
To provide fault tolerance each shard is replicated using logical 
replication to one or more nodes. Right now we considered only 
redundancy level 1 - each shard has only one replica.

So from each node we establish 10 logical replication channels.

We want commit to wait until data is actually stored at all replicas, so 
we are using synchronous replication:
So we set synchronous_commit option to "on" and include all ten 10 
subscriptions in synchronous_standby_names list.


In this setup commit latency is very large (about 100msec and most of 
the time is actually spent in commit) and performance is very bad - 
pgbench shows about 300 TPS for optimal number of clients (about 10, for 
larger number performance is almost the same). Without logical 
replication at the same setup we get about 6000 TPS.


I have checked syncrepl.c file, particularly SyncRepGetSyncRecPtr 
function. Each wal sender independently calculates minimal LSN among all 
synchronous replicas and wakeup backends waiting for this LSN. It means 
that transaction performing update of data in one shard will actually 
wait confirmation from replication channels for all shards.
If some shard is updated rarely than other or is not updated at all (for 
example because communication channels between this node is broken), 
then all backens will stuck.
Also all backends are competing for the single SyncRepLock, which also 
can be a contention point.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Columnar storage support

2017-10-10 Thread Konstantin Knizhnik

Unfortunately C-Store doesn't allow to take all advantages of columnar
store: you still not be able to perform vector operation.s

C-Store allows to reduce size of data read from the disk because of
1. fetching only columns which are used in the query,
2. data compression.

It will lead to some benefits in query execution speed for cold data
(when it is not loaded in cache).
For warm data there is almost no difference (except very huge tables
which can not fit in memory).

But the main advantage of vertical data format - vector data processing
- is possible only with specialized executor.

There is prototype of vector executor for C-Store:
https://github.com/citusdata/postgres_vectorization_test
It provides 3-4x speedup of some queries, but it is really prototype and
research project, for from practical usage.

I have also developed two columnar store extensions for Postgres:
IMCS (In-Memory-Columnar-Store): https://github.com/knizhnik/imcs.git
VOPS (Vectorized Operations): https://github.com/postgrespro/vops.git

First one is more oriented on in-memory databases (although support
spilling data to the disk) and requires to use special functions to
manipulate with columnar data.
In this case columnar store is copy of main (horizontal) store (normal
Postgres tables).

VOPS is more recent work, allowing to use more or less normal SQL (using
foreign data wrapper and user defined types/operators).
In VOPS data is stored inside normal Postgres tables, but using vectors
(tiles) instead of scalars.

Both IMCS and VOPS provides 10-100 times speed improvement on queries
like Q1 in TPC-H (sequential scan with filtering and aggregation).
In queries involving joins there is almost no benefit comparing with
normal Postgres.

There is also columnar storage extension developed by Fujitsu:
https://www.postgresql.org/message-id/cajrrpgfac7wc9nk6ptty6yn-nn+hcy8xolah2doyhvg5d6h...@mail.gmail.com
But the published patch is only first step in this direction and it is
not possible neither to use it in practice, neither perform some
experiments measuring possible improvement of performance.

On 09.10.2017 23:06, Joshua D. Drake wrote:

On 10/09/2017 01:03 PM, legrand legrand wrote:
Is there a chance that pluggable storage permits creation of a
columnar rdbms

as monetDB in PostgreSQL ?
Thanks un advance for thé answer

The extension C-Store from Citus is probably what you are looking for.

--
Sent from:
http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Slow synchronous logical replication

2017-10-11 Thread Konstantin Knizhnik




On 11.10.2017 10:07, Craig Ringer wrote:

On 9 October 2017 at 15:37, Konstantin Knizhnik
 wrote:

Thank you for explanations.

On 08.10.2017 16:00, Craig Ringer wrote:

I think it'd be helpful if you provided reproduction instructions,
test programs, etc, making it very clear when things are / aren't
related to your changes.


It will be not so easy to provide some reproducing scenario, because
actually it involves many components (postgres_fdw, pg_pasthman,
pg_shardman, LR,...)

So simplify it to a test case that doesn't.

The simplest reproducing scenario is the following:
1. Start two Posgtgres instances: synchronous_commit=on, fsync=off
2. Initialize pgbench database at both instances: pgbench -i
3. Create publication for pgbench_accounts table at one node
4. Create correspondent subscription at another node with 
copy_data=false parameter

5. Add subscription to synchronous_standby_names at first node.
6. Start pgbench -c 8 -N -T 100 -P 1 at first node. At my systems 
results are the following:

standalone postgres: 8600 TPS
asynchronous replication: 6600 TPS
synchronous replication:   5600 TPS
Quite good results.
7. Create some dummy table and perform bulk insert in it:
create table dummy(x integer primary key);
insert into dummy values (generate_series(1,1000));

pgbench almost stuck: until end of insert performance drops almost 
to zero.


The reason of such behavior is obvious: wal sender has to decode huge 
transaction generate by insert although it has no relation to this 
publication.
Filtering of insert records of this transaction is done only inside 
output plug-in.
Unfortunately it is not quite clear how to make wal-sender smarter and 
let him skip transaction not affecting its publication.
Once of the possible solutions is to let backend inform wal-sender about 
smallest LSN it should wait for (backend knows which table is affected 
by current operation,
so which publications are interested in this operation and so can point 
wal -sender to the proper LSN without decoding huge part of WAL.

But it seems to be not so easy to implement.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Slow synchronous logical replication

2017-10-12 Thread Konstantin Knizhnik




On 12.10.2017 04:23, Craig Ringer wrote:

On 12 October 2017 at 00:57, Konstantin Knizhnik
 wrote:


The reason of such behavior is obvious: wal sender has to decode huge
transaction generate by insert although it has no relation to this
publication.

It does. Though I wouldn't expect anywhere near the kind of drop you
report, and haven't observed it here.

Is the CREATE TABLE and INSERT done in the same transaction?


No. Table was create in separate transaction.
Moreover  the same effect will take place if table is create before 
start of replication.
The problem in this case seems to be caused by spilling decoded 
transaction to the file by ReorderBufferSerializeTXN.

Please look at two profiles:
http://garret.ru/lr1.svg  corresponds to normal work if pgbench with 
synchronous replication to one replica,
http://garret.ru/lr2.svg - the with concurrent execution of huge insert 
statement.


And here is output of pgbench (at fifth second insert is started):

progress: 1.0 s, 10020.9 tps, lat 0.791 ms stddev 0.232
progress: 2.0 s, 10184.1 tps, lat 0.786 ms stddev 0.192
progress: 3.0 s, 10058.8 tps, lat 0.795 ms stddev 0.301
progress: 4.0 s, 10230.3 tps, lat 0.782 ms stddev 0.194
progress: 5.0 s, 10335.0 tps, lat 0.774 ms stddev 0.192
progress: 6.0 s, 4535.7 tps, lat 1.591 ms stddev 9.370
progress: 7.0 s, 419.6 tps, lat 20.897 ms stddev 55.338
progress: 8.0 s, 105.1 tps, lat 56.140 ms stddev 76.309
progress: 9.0 s, 9.0 tps, lat 504.104 ms stddev 52.964
progress: 10.0 s, 14.0 tps, lat 797.535 ms stddev 156.082
progress: 11.0 s, 14.0 tps, lat 601.865 ms stddev 93.598
progress: 12.0 s, 11.0 tps, lat 658.276 ms stddev 138.503
progress: 13.0 s, 9.0 tps, lat 784.120 ms stddev 127.206
progress: 14.0 s, 7.0 tps, lat 870.944 ms stddev 156.377
progress: 15.0 s, 8.0 tps, lat .578 ms stddev 140.987
progress: 16.0 s, 7.0 tps, lat 1258.750 ms stddev 75.677
progress: 17.0 s, 6.0 tps, lat 991.023 ms stddev 229.058
progress: 18.0 s, 5.0 tps, lat 1063.986 ms stddev 269.361

It seems to be effect of large transactions.
Presence of several channels of synchronous logical replication reduce 
performance, but not so much.

Below are results at another machine and pgbench with scale 10.

Configuraion
standalone
1 async logical replica
1 sync logical replca
3 async logical replicas
3 syn logical replicas
TPS
15k
13k
10k
13k
8k





Only partly true. The output plugin can register a transaction origin
filter and use that to say it's entirely uninterested in a
transaction. But this only works based on filtering by origins. Not
tables.
Yes I know about origin filtering mechanism (and we are using it in 
multimaster).
But I am speaking about standard pgoutput.c output plugin. it's 
pgoutput_origin_filter

always returns false.




I imagine we could call another hook in output plugins, "do you care
about this table", and use it to skip some more work for tuples that
particular decoding session isn't interested in. Skip adding them to
the reorder buffer, etc. No such hook currently exists, but it'd be an
interesting patch for Pg11 if you feel like working on it.


Unfortunately it is not quite clear how to make wal-sender smarter and let
him skip transaction not affecting its publication.

As noted, it already can do so by origin. Mostly. We cannot totally
skip over WAL, since we need to process various invalidations etc. See
ReorderBufferSkip.
The problem is that before end of transaction we do not know whether it 
touch this publication or not.

So filtering by origin will not work in this case.

I really not sure that it is possible to skip over WAL. But the 
particular problem with invalidation records etc  can be solved by 
always processing this records by WAl sender.
I.e. if backend is inserting invalidation record or some other record 
which always should be processed by WAL sender, it can always promote 
LSN of this record to WAL sender.
So WAl sender will skip only those WAl records which is safe to skip 
(insert/update/delete records not affecting this publication).


I wonder if there can be some other problems with skipping part of 
transaction by WAL sender.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

[HACKERS] Deadlock in ALTER SUBSCRIPTION REFRESH PUBLICATION

2017-10-24 Thread Konstantin Knizhnik

ty_hook (first_arg=0x24dd998, 
queryString=0x24dc920 "SET SESSION synchronous_commit TO local; ALTER SUBSCRIPTION 
sub_3_1 REFRESH PUBLICATION",
context=PROCESS_UTILITY_QUERY, params=0x0, queryEnv=0x0, dest=0x24ddcb8, 
completionTag=0x73a43c90 "") at src/hooks.c:913
#14 0x008a80a2 in ProcessUtility (pstmt=0x24dd998, queryString=0x24dc920 
"SET SESSION synchronous_commit TO local; ALTER SUBSCRIPTION sub_3_1 REFRESH 
PUBLICATION",
context=PROCESS_UTILITY_QUERY, params=0x0, queryEnv=0x0, dest=0x24ddcb8, 
completionTag=0x73a43c90 "") at utility.c:353
#15 0x008a7075 in PortalRunUtility (portal=0x246be60, pstmt=0x24dd998, 
isTopLevel=0 '\000', setHoldSnapshot=0 '\000', dest=0x24ddcb8, 
completionTag=0x73a43c90 "") at pquery.c:1178
#16 0x008a728d in PortalRunMulti (portal=0x246be60, isTopLevel=0 '\000', 
setHoldSnapshot=0 '\000', dest=0x24ddcb8, altdest=0x24ddcb8, completionTag=0x73a43c90 
"") at pquery.c:1324
#17 0x008a6757 in PortalRun (portal=0x246be60, count=9223372036854775807, 
isTopLevel=0 '\000', run_once=1 '\001', dest=0x24ddcb8, altdest=0x24ddcb8, 
completionTag=0x73a43c90 "")
at pquery.c:799
#18 0x008a0288 in exec_simple_query (query_string=0x24dc920 "SET SESSION 
synchronous_commit TO local; ALTER SUBSCRIPTION sub_3_1 REFRESH PUBLICATION") at 
postgres.c:1099
#19 0x008a4823 in PostgresMain (argc=1, argv=0x247c2d0, dbname=0x247c2a8 
"postgres", username=0x244f870 "knizhnik") at postgres.c:4090
#20 0x00801753 in BackendRun (port=0x240) at postmaster.c:4357
#21 0x00800e5f in BackendStartup (port=0x240) at postmaster.c:4029
#22 0x007fd398 in ServerLoop () at postmaster.c:1753
#23 0x007fc92f in PostmasterMain (argc=3, argv=0x244d6e0) at 
postmaster.c:1361
#24 0x00734f08 in main (argc=3, argv=0x244d6e0) at main.c:228



The reason of this deadlock seems to be clear: ALTER SUBSCRIPTION starts 
transaction at one node and tries to create slot at other node, which waiting 
for completion of all active transaction while building scnapshpot.
Is there any way to avoid this deadlock?

The same deadlock can happen in concurrent CREATE SUBSCRIPTION with implicit 
slot creation. But here it can be sovled by explicit slot creation prior to 
CREATE SUBSCRIPTION.
But what can I do if I want to update publications and refresh subscriptions?

Thanks in advance,

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Secondary index access optimizations

2017-11-06 Thread Konstantin Knizhnik


On 11/06/2017 04:27 AM, Thomas Munro wrote:

On Fri, Sep 8, 2017 at 3:58 AM, Konstantin Knizhnik
 wrote:

Updated version of the patch is attached to this mail.
Also I added support of date type to operator_predicate_proof to be able to
imply (logdate <= '2017-03-31') from (logdate < '2017-04-01') .

Hi Konstantin,

Is there any reason why you don't want to split this into two separate
proposals?  One for remove_restrictions_implied_by_constraints() and
one for the operator_predicate_proof() changes.

Your v3 patch breaks the new partition_join test (the recently
committed partition-wise join stuff), as far as I can tell in a good
way.  Can you please double check those changes and post an updated
patch?


Hi Thomas.

The primary idea of this patch was to provide more efficient plans for queries 
on partitioned tables.
So remove_restrictions_implied_by_constraints() removes redundant predicate 
checks.
But it doesn't work for standard Postgres 10 partitioning, because here 
constraints are set using intervals with open high boundary and original 
version of
operator_predicate_proof() is not able to handle this case.

I have explained this problem in my previous e-mails in this thread.
This is why I have changed operator_predicate_proof() to correctly handle this 
case.

If you think this patch should be splitted into two, ok: I can do it.
I just want to notice that without patching operator_predicate_proof() it may 
give not positive effect for standard partitioning,
which I expect to be the most popular use case where this optimization may have 
an effect.


Concerning broken partition_join test: it is "expected" failure: my patch 
removes from the plans redundant checks.
So the only required action is to update expected file with results.
Attached please find updated patch.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 4339bbf..0931af1 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -626,12 +626,12 @@ EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE c1 IS NULL;-- Nu
Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8 FROM "S 1"."T 1" WHERE (("C 1" IS NULL))
 (3 rows)
 
-EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE c1 IS NOT NULL;-- NullTest
- QUERY PLAN  
--
+EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE c1 IS NOT NULL and c3 is not null;-- NullTest
+QUERY PLAN
+--
  Foreign Scan on public.ft1 t1
Output: c1, c2, c3, c4, c5, c6, c7, c8
-   Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8 FROM "S 1"."T 1" WHERE (("C 1" IS NOT NULL))
+   Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8 FROM "S 1"."T 1" WHERE ((c3 IS NOT NULL))
 (3 rows)
 
 EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE round(abs(c1), 0) = 1; -- FuncExpr
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index ddfec79..878bfc7 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -292,7 +292,7 @@ RESET enable_nestloop;
 EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE t1.c1 = 1; -- Var, OpExpr(b), Const
 EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE t1.c1 = 100 AND t1.c2 = 0; -- BoolExpr
 EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE c1 IS NULL;-- NullTest
-EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE c1 IS NOT NULL;-- NullTest
+EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE c1 IS NOT NULL and c3 is not null;-- NullTest
 EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE round(abs(c1), 0) = 1; -- FuncExpr
 EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE c1 = -c1;  -- OpExpr(l)
 EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE 1 = c1!;   -- OpExpr(r)
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index a5c1b68..082d1cc 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -346,6 +346,7 @@ set_rel_size(PlannerInfo *root, RelOptInfo *rel,
 		switch (rel->rtekind)
 		{
 			case RTE_RELATION:
+remove_restrictions_implied_by_constraints(root, rel, rte);
 if (rte->relkind == RELKIND_FOREIGN_TABLE)

Re: [HACKERS] SQL procedures

2017-11-08 Thread Konstantin Knizhnik




On 08.11.2017 17:23, Merlin Moncure wrote:

On Tue, Oct 31, 2017 at 12:23 PM, Peter Eisentraut
 wrote:

- Transaction control in procedure bodies

This feature is really key, since it enables via SQL lots of things
that are not possible without external coding, including:
*) very long running processes in a single routine
*) transaction isolation control inside the procedure (currently
client app has to declare this)
*) certain error handling cases that require client side support
*) simple in-database threading
*) simple construction of daemon scripts (yeah, you can use bgworker
for this, but pure sql daemon with a cron heartbeat hook is hard to
beat for simplicity)

I do wonder how transaction control could be added later.

The last time I (lightly) looked at this, I was starting to think that
working transaction control into the SPI interface was the wrong
approach; pl/pgsql would have to adopt a very different set of
behaviors if it was called in a function or a proc.  If you restricted
language choice to purely SQL, you could work around this problem; SPI
languages would be totally abstracted from those sets of
considerations and you could always call an arbitrary language
function if you needed to.  SQL has no flow control but I'm not too
concerned about that.

merlin


I am also very interested in answer on this question: how you are going 
to implement transaction control inside procedure?
Right now in PostgresPRO EE supports autonomous transactions. Them are 
supported both for SQL and plpgsql/plpython APIs.
Them are implemented by saving/restoring transaction context, so unlike 
most of other ATX implementations, in pgpro autonomous
transaction is executed by the same backend. But it is not so easy to 
do: in Postgres almost any module have its own static variables which 
keeps transaction specific data.
So we have to provide a dozen of suspend/resume functions: 
SuspendSnapshot(),  SuspendPredicate(), SuspendStorage(), 
SuspendInvalidationInfo(), SuspendPgXact(), PgStatSuspend(), 
TriggerSuspend(), SuspendSPI()... and properly handle local cache 
invalidation. Patch consists of more than 5 thousand lines.


So my question is whether you are going to implement something similar 
or use completely different approach?
In first case it will be good to somehow unite our efforts... For 
example we can publish our ATX patch for Postgres 10.
We have not done it yet, because there seems to be no chances to push 
this patch to community.









--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Aggregates push-down to partitions

2017-11-09 Thread Konstantin Knizhnik


There is a huge thread concerning pushing-down aggregates to FDW:

https://www.postgresql.org/message-id/flat/CAFjFpRcnueviDpngJ3QSVvj7oyukr9NkSiCspqd4N%2BdCEdvYvg%40mail.gmail.com#cafjfprcnuevidpngj3qsvvj7oyukr9nksicspqd4n+dcedv...@mail.gmail.com

but as far as I understand nothing is done for efficient calculation of 
aggregates for partitioned table.
In case of local partitions it is somehow compensated by parallel query 
plan:


postgres=# create table base(x integer);
CREATE TABLE
postgres=# create table derived1() inherits (base);
CREATE TABLE
postgres=# create table derived2() inherits (base);
CREATE TABLE
postgres=# insert into derived1  values (generate_series(1,100));
INSERT 0 100
postgres=# insert into derived2  values (generate_series(1,100));
INSERT 0 100
postgres=# explain select sum(x) from base;
   QUERY PLAN
-
 Finalize Aggregate  (cost=12176.63..12176.64 rows=1 width=8)
   ->  Gather  (cost=12176.59..12176.61 rows=8 width=8)
 Workers Planned: 8
 ->  Partial Aggregate  (cost=12175.59..12175.60 rows=1 width=8)
   ->  Append  (cost=0.00..11510.47 rows=266048 width=4)
 ->  Parallel Seq Scan on base (cost=0.00..0.00 
rows=1 width=4)
 ->  Parallel Seq Scan on derived1 
(cost=0.00..5675.00 rows=125000 width=4)
 ->  Parallel Seq Scan on derived2 
(cost=0.00..5835.47 rows=141047 width=4)

(8 rows)

It is still far from ideal plan because each worker is working with all 
partitions, instead of spitting partitions between workers and calculate 
partial aggregates for each partition.


But if we add FDW as a child of parent table, then parallel scan can not 
be used and we get the worst possible plan:


postgres=# create foreign table derived_fdw() inherits(base) server 
pg_fdw options (table_name 'derived1');CREATE FOREIGN TABLE

postgres=# explain select sum(x) from base;
    QUERY PLAN
--
 Aggregate  (cost=34055.07..34055.08 rows=1 width=8)
   ->  Append  (cost=0.00..29047.75 rows=2002926 width=4)
 ->  Seq Scan on base  (cost=0.00..0.00 rows=1 width=4)
 ->  Seq Scan on derived1  (cost=0.00..14425.00 rows=100 
width=4)
 ->  Seq Scan on derived2  (cost=0.00..14425.00 rows=100 
width=4)
 ->  Foreign Scan on derived_fdw  (cost=100.00..197.75 
rows=2925 width=4)

(6 rows)

So we sequentially pull all data to this node and compute aggregates 
locally.
Ideal plan will calculate in parallel partial aggregates at all nodes 
and then combine partial results.

It requires two changes:
1. Replace Aggregate->Append with 
Finalize_Aggregate->Append->Partial_Aggregate
2. Concurrent execution of Append. It also can be done in two different 
ways: we can try to use existed parallel workers infrastructure and
replace Append with Gather. It seems to be the best approach for local 
partitioning. In case of remote (FDW) partitions, it is enough
to split starting of execution (PQsendQuery in postgres_fdw) and getting 
results. So it requires some changes in FDW protocol.



I wonder if somebody already investigate this problem or working in this 
direction.

May be there are already some patches proposed?
I have searched hackers archive, but didn't find something relevant...
Are there any suggestions about the best approach to implement this feature?

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Aggregates push-down to partitions

2017-11-10 Thread Konstantin Knizhnik




On 10.11.2017 12:15, Ashutosh Bapat wrote:

Maybe in this thread[1] your described problem are solved through

introducing Parallel Append node?

1.
https://www.postgresql.org/message-id/CAJ3gD9dy0K_E8r727heqXoBmWZ83HwLFwdcaSSmBQ1%2BS%2BvRuUQ%40mail.gmail.com

You may want to review [2] and [3] as well.

[2] https://www.postgresql.org/message-id/9666.1491295317@localhost
[3] 
https://www.postgresql.org/message-id/CAM2+6=V64_xhstVHie0Rz=kpeqnljmzt_e314p0jat_oj9m...@mail.gmail.com

Thank you very much for this references.
I applied partition-wise-agg-v6 patches and for partitioned tables it 
works perfectly:


shard=# explain select count(*) from orders;
  QUERY PLAN
---
 Finalize Aggregate  (cost=100415.29..100415.30 rows=1 width=8)
   ->  Append  (cost=50207.63..100415.29 rows=2 width=8)
 ->  Partial Aggregate  (cost=50207.63..50207.64 rows=1 width=8)
   ->  Foreign Scan on orders_0 (cost=101.00..50195.13 
rows=5000 width=0)

 ->  Partial Aggregate  (cost=50207.63..50207.64 rows=1 width=8)
   ->  Foreign Scan on orders_1 (cost=101.00..50195.13 
rows=5000 width=0)

(6 rows)

But I wonder why the same optimization is not applied to normal 
inherited table:


shard=# explain select count(*) from base;
    QUERY PLAN
--
 Aggregate  (cost=44087.99..44088.00 rows=1 width=8)
   ->  Append  (cost=0.00..39079.46 rows=2003414 width=0)
 ->  Seq Scan on base  (cost=0.00..0.00 rows=1 width=0)
 ->  Seq Scan on derived1  (cost=0.00..14425.00 rows=100 
width=0)
 ->  Seq Scan on derived2  (cost=0.00..14425.00 rows=100 
width=0)
 ->  Foreign Scan on derived_fdw  (cost=100.00..212.39 
rows=3413 width=0)

(6 rows)

Are there some principle problems?

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise aggregation/grouping

2017-11-11 Thread Konstantin Knizhnik


On 10/27/2017 02:01 PM, Jeevan Chalke wrote:

Hi,

Attached new patch-set here. Changes include:

1. Added separate patch for costing Append node as discussed up-front in the
patch-set.
2. Since we now cost Append node, we don't need partition_wise_agg_cost_factor
GUC. So removed that. The remaining patch hence merged into main implementation
patch.
3. Updated rows in test-cases so that we will get partition-wise plans.

Thanks


I applied partition-wise-agg-v6.tar.gz patch to  the master and use shard.sh 
example from https://www.postgresql.org/message-id/14577.1509723225%40localhost
Plan for count(*) is the following:

shard=# explain select count(*) from orders;
  QUERY PLAN
---
 Finalize Aggregate  (cost=100415.29..100415.30 rows=1 width=8)
   ->  Append  (cost=50207.63..100415.29 rows=2 width=8)
 ->  Partial Aggregate  (cost=50207.63..50207.64 rows=1 width=8)
   ->  Foreign Scan on orders_0 (cost=101.00..50195.13 rows=5000 
width=0)
 ->  Partial Aggregate  (cost=50207.63..50207.64 rows=1 width=8)
   ->  Foreign Scan on orders_1 (cost=101.00..50195.13 rows=5000 
width=0)


We really calculate partial aggregate for each partition, but to do we still 
have to fetch all data from remote host.
So for foreign partitions such plans is absolutely inefficient.
Amy be it should be combined with some other patch?
For example, with  agg_pushdown_v4.tgz patch 
https://www.postgresql.org/message-id/14577.1509723225%40localhost ?
But it is not applied after partition-wise-agg-v6.tar.gz patch.
Also postgres_fdw in 11dev is able to push down aggregates without 
agg_pushdown_v4.tgz patch.

In 0009-Teach-postgres_fdw-to-push-aggregates-for-child-rela.patch
there is the following check:

 /* Partial aggregates are not supported. */
+   if (extra->isPartial)
+   return;

If we just comment this line then produced plan will be the following:

shard=# explain select sum(product_id) from orders;
   QUERY PLAN

 Finalize Aggregate  (cost=308.41..308.42 rows=1 width=8)
   ->  Append  (cost=144.18..308.41 rows=2 width=8)
 ->  Foreign Scan  (cost=144.18..154.20 rows=1 width=8)
   Relations: Aggregate on (public.orders_0 orders)
 ->  Foreign Scan  (cost=144.18..154.20 rows=1 width=8)
   Relations: Aggregate on (public.orders_1 orders)
(6 rows)

And it is actually desired plan!
Obviously such approach will not always work. FDW really doesn't support 
partial aggregates now.
But for most frequently used aggregates: sum, min, max, count 
aggtype==aggtranstype and there is no difference
between partial and normal aggregate calculation.
So instead of (extra->isPartial) condition we can add more complex check which 
will traverse pathtarget expressions and
check if it can be evaluated in this way. Or... extend FDW API to support 
partial aggregation.

But even the last plan is not ideal: it will calculate predicates at each 
remote node sequentially.
There is parallel append patch:
https://www.postgresql.org/message-id/CAJ3gD9ctEcrVUmpY6fq_JUB6WDKGXAGd70EY68jVFA4kxMbKeQ%40mail.gmail.com
but ... FDW doesn't support parallel scan, so parallel append can not be 
applied in this case.
And we actually do not need parallel append with all its dynamic workers here.
We just need to start commands at all remote servers and only after it fetch 
results (which can be done sequentially).

I am investigating problem of efficient execution of OLAP queries on sharded 
tables (tables with remote partitions).
After reading all this threads and corresponding  patches, it seems to me
that we already have most of parts of the puzzle, what we need is to put them 
on right places and may be add missed ones.
I wonder if somebody is busy with it and can I somehow help here?

Also I am not quite sure about the best approach with parallel execution of 
distributed query at all nodes.
Should we make postgres_fdw parallel safe and use parallel append? How 
difficult it will be?
Or in addition to parallel append we should also have "asynchronous append" 
which will be able to initiate execution at all nodes?
It seems to be close to merge append, because it should simultaneously traverse 
all cursors.

Looks like second approach is easier for implementation. But in case of sharded 
table, distributed query may need to traverse both remote
and local shards and this approach doesn't allow to processed several local 
shards in parallel.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise aggregation/grouping

2017-11-13 Thread Konstantin Knizhnik




On 11.11.2017 23:29, Konstantin Knizhnik wrote:

On 10/27/2017 02:01 PM, Jeevan Chalke wrote:

Hi,

Attached new patch-set here. Changes include:

1. Added separate patch for costing Append node as discussed up-front 
in the

patch-set.
2. Since we now cost Append node, we don't need 
partition_wise_agg_cost_factor
GUC. So removed that. The remaining patch hence merged into main 
implementation

patch.
3. Updated rows in test-cases so that we will get partition-wise plans.

Thanks


I applied partition-wise-agg-v6.tar.gz patch to  the master and use 
shard.sh example from 
https://www.postgresql.org/message-id/14577.1509723225%40localhost

Plan for count(*) is the following:

shard=# explain select count(*) from orders;
  QUERY PLAN
--- 


 Finalize Aggregate  (cost=100415.29..100415.30 rows=1 width=8)
   ->  Append  (cost=50207.63..100415.29 rows=2 width=8)
 ->  Partial Aggregate  (cost=50207.63..50207.64 rows=1 width=8)
   ->  Foreign Scan on orders_0 (cost=101.00..50195.13 
rows=5000 width=0)

 ->  Partial Aggregate  (cost=50207.63..50207.64 rows=1 width=8)
   ->  Foreign Scan on orders_1 (cost=101.00..50195.13 
rows=5000 width=0)



We really calculate partial aggregate for each partition, but to do we 
still have to fetch all data from remote host.

So for foreign partitions such plans is absolutely inefficient.
Amy be it should be combined with some other patch?
For example, with  agg_pushdown_v4.tgz patch 
https://www.postgresql.org/message-id/14577.1509723225%40localhost ?

But it is not applied after partition-wise-agg-v6.tar.gz patch.
Also postgres_fdw in 11dev is able to push down aggregates without 
agg_pushdown_v4.tgz patch.


In 0009-Teach-postgres_fdw-to-push-aggregates-for-child-rela.patch
there is the following check:

 /* Partial aggregates are not supported. */
+   if (extra->isPartial)
+   return;

If we just comment this line then produced plan will be the following:

shard=# explain select sum(product_id) from orders;
   QUERY PLAN

 Finalize Aggregate  (cost=308.41..308.42 rows=1 width=8)
   ->  Append  (cost=144.18..308.41 rows=2 width=8)
 ->  Foreign Scan  (cost=144.18..154.20 rows=1 width=8)
   Relations: Aggregate on (public.orders_0 orders)
 ->  Foreign Scan  (cost=144.18..154.20 rows=1 width=8)
   Relations: Aggregate on (public.orders_1 orders)
(6 rows)

And it is actually desired plan!
Obviously such approach will not always work. FDW really doesn't 
support partial aggregates now.
But for most frequently used aggregates: sum, min, max, count 
aggtype==aggtranstype and there is no difference

between partial and normal aggregate calculation.
So instead of (extra->isPartial) condition we can add more complex 
check which will traverse pathtarget expressions and
check if it can be evaluated in this way. Or... extend FDW API to 
support partial aggregation.


But even the last plan is not ideal: it will calculate predicates at 
each remote node sequentially.

There is parallel append patch:
https://www.postgresql.org/message-id/CAJ3gD9ctEcrVUmpY6fq_JUB6WDKGXAGd70EY68jVFA4kxMbKeQ%40mail.gmail.com 

but ... FDW doesn't support parallel scan, so parallel append can not 
be applied in this case.
And we actually do not need parallel append with all its dynamic 
workers here.
We just need to start commands at all remote servers and only after it 
fetch results (which can be done sequentially).


I am investigating problem of efficient execution of OLAP queries on 
sharded tables (tables with remote partitions).

After reading all this threads and corresponding  patches, it seems to me
that we already have most of parts of the puzzle, what we need is to 
put them on right places and may be add missed ones.

I wonder if somebody is busy with it and can I somehow help here?

Also I am not quite sure about the best approach with parallel 
execution of distributed query at all nodes.
Should we make postgres_fdw parallel safe and use parallel append? How 
difficult it will be?
Or in addition to parallel append we should also have "asynchronous 
append" which will be able to initiate execution at all nodes?
It seems to be close to merge append, because it should simultaneously 
traverse all cursors.


Looks like second approach is easier for implementation. But in case 
of sharded table, distributed query may need to traverse both remote
and local shards and this approach doesn't allow to processed several 
local shards in parallel.




I attach small patch for postgres_fdw.c which allows concurrent 
execution of aggregates by all remote servers (when them are accessed 
through postgres_fdw).
I have added "postgres_fdw.use_

[HACKERS] On conflict update & hint bits

2016-09-30 Thread Konstantin Knizhnik


Hi,

I am faced with rarely reproduced problem at our multimaster (and never 
at vanilla Postgres).
We are using our own customized transaction manager, so it may be 
definitely the problem in our multimaster.
But stack trace looks suspiciously and this is why I want to consult 
with people familiar with this code whether it is bug in 
ExecOnConflictUpdate or not.


Briefly: ExecOnConflictUpdate tries to set hint bit without holding lock 
on the buffer and so get assertion failure in MarkBufferDirtyHint.


Now stack trace:

#0  0x7fe3b940acc9 in __GI_raise (sig=sig@entry=6) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:56

#1  0x7fe3b940e0d8 in __GI_abort () at abort.c:89
#2  0x0097b996 in ExceptionalCondition (conditionName=0xb4d970 
"!(LWLockHeldByMe(((LWLock*) (&(bufHdr)->content_lock", 
errorType=0xb4d2e9 "FailedAssertion",

fileName=0xb4d2e0 "bufmgr.c", lineNumber=3380) at assert.c:54
#3  0x007e365b in MarkBufferDirtyHint (buffer=946, buffer_std=1 
'\001') at bufmgr.c:3380
#4  0x009c3660 in SetHintBits (tuple=0x7fe396a9d858, buffer=946, 
infomask=256, xid=1398) at tqual.c:136
#5  0x009c5194 in HeapTupleSatisfiesMVCC (htup=0x7ffc00169030, 
snapshot=0x2e79778, buffer=946) at tqual.c:1065
#6  0x006ace83 in ExecCheckHeapTupleVisible (estate=0x2e81ae8, 
tuple=0x7ffc00169030, buffer=946) at nodeModifyTable.c:197
#7  0x006ae343 in ExecOnConflictUpdate (mtstate=0x2e81d50, 
resultRelInfo=0x2e81c38, conflictTid=0x7ffc001690c0, planSlot=0x2e82428, 
excludedSlot=0x2e82428, estate=0x2e81ae8,

canSetTag=1 '\001', returning=0x7ffc001690c8) at nodeModifyTable.c:1173
#8  0x006ad256 in ExecInsert (mtstate=0x2e81d50, slot=0x2e82428, 
planSlot=0x2e82428, arbiterIndexes=0x2e7eeb0, 
onconflict=ONCONFLICT_UPDATE, estate=0x2e81ae8, canSetTag=1 '\001')

at nodeModifyTable.c:395
#9  0x006aebe3 in ExecModifyTable (node=0x2e81d50) at 
nodeModifyTable.c:1496


In ExecOnConflictUpdate buffer is pinned but not locked:

/*
 * Lock tuple for update.  Don't follow updates when tuple cannot be
 * locked without doing so.  A row locking conflict here means our
 * previous conclusion that the tuple is conclusively committed is not
 * true anymore.
 */
tuple.t_self = *conflictTid;
test = heap_lock_tuple(relation, &tuple, estate->es_output_cid,
   lockmode, LockWaitBlock, false, &buffer,
   &hufd);

heap_lock_tuple is pinning buffer but not locking it:
 **buffer: set to buffer holding tuple (pinned but not locked at exit)

Later we try to check tuple visibility:

ExecCheckHeapTupleVisible(estate, &tuple, buffer);

and inside HeapTupleSatisfiesMVCC try to set hint bit.

MarkBufferDirtyHint assumes that buffer is locked:
 * 2. The caller might have only share-lock instead of exclusive-lock 
on the

 *  buffer's content lock.

and we get assertion failure in

/* here, either share or exclusive lock is OK */
Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));

So the question is whether it is correct that ExecOnConflictUpdate tries 
to access and update tuple without holding lock on the buffer?


Thank in advance,

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On conflict update & hint bits

2016-09-30 Thread Konstantin Knizhnik




On 30.09.2016 19:37, Peter Geoghegan wrote:

On Fri, Sep 30, 2016 at 5:33 PM, Konstantin Knizhnik
 wrote:

Later we try to check tuple visibility:

 ExecCheckHeapTupleVisible(estate, &tuple, buffer);

and inside HeapTupleSatisfiesMVCC try to set hint bit.

So, you're using repeatable read or serializable isolation level?


Repeatable read.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On conflict update & hint bits

2016-10-24 Thread Konstantin Knizhnik




On 24.10.2016 00:49, Peter Geoghegan wrote:

On Sun, Oct 23, 2016 at 2:46 PM, Tom Lane  wrote:

What's bothering me is (a) it's less than 24 hours to release wrap and
(b) this patch changes SSI-relevant behavior and hasn't been approved
by Kevin.  I'm not familiar enough with that logic to commit a change
in it on my own authority, especially with so little time for problems
to be uncovered.

I should point out that I knew that the next set of point releases had
been moved forward much later than you did. I had to work on this fix
during the week, which was pretty far from ideal for me for my own
reasons.

Just for information: I know that you are working on this issue, but as 
far as we need to proceed further with our testing of multimaster,
I have done the following obvious changes and it fixes the problem (at 
least this assertion failure is not happen any more):


src/backend/executor/nodeModifyTable.c

@@ -1087,6 +1087,13 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
 test = heap_lock_tuple(relation, &tuple, estate->es_output_cid,
lockmode, LockWaitBlock, false, &buffer,
&hufd);
+/*
+ * We must hold share lock on the buffer content while examining tuple
+ * visibility.  Afterwards, however, the tuples we have found to be
+ * visible are guaranteed good as long as we hold the buffer pin.
+ */
+LockBuffer(buffer, BUFFER_LOCK_SHARE);
+
 switch (test)
 {
 case HeapTupleMayBeUpdated:
@@ -1142,6 +1149,7 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
  * loop here, as the new version of the row might not conflict
  * anymore, or the conflicting tuple has actually been 
deleted.

  */
+LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 ReleaseBuffer(buffer);
 return false;

@@ -1175,6 +1183,8 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
 /* Store target's existing tuple in the state's dedicated slot */
 ExecStoreTuple(&tuple, mtstate->mt_existing, buffer, false);

+LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
 /*
  * Make tuple and any needed join variables available to ExecQual and
  * ExecProject.  The EXCLUDED tuple is installed in 
ecxt_innertuple, while


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Unlogged tables cleanup

2016-11-09 Thread Konstantin Knizhnik


Hi, hackers

I wonder if such behavior can be considered as a bug:

knizhnik@knizhnik:~/dtm-data$ psql postgres
psql (10devel)
Type "help" for help.

postgres=# create tablespace fs location '/home/knizhnik/dtm-data/fs';
CREATE TABLESPACE
postgres=# set default_tablespace=fs;
SET
postgres=# create unlogged table foo(x integer);
CREATE TABLE
postgres=# insert into foo values(generate_series(1,10));
INSERT 0 10



Now simulate server crash using using "pkill -9 postgres".

knizhnik@knizhnik:~/dtm-data$ rm -f logfile ; pg_ctl -D pgsql.master -l 
logfile start

pg_ctl: another server might be running; trying to start server anyway
server starting
knizhnik@knizhnik:~/dtm-data$ psql postgres
psql (10devel)
Type "help" for help.

postgres=# select * from foo;
ERROR:  could not open file 
"pg_tblspc/16384/PG_10_201611041/12289/16385": No such file or directory


knizhnik@knizhnik:~/dtm-data$ ls fs
PG_10_201611041
knizhnik@knizhnik:~/dtm-data$ ls fs/PG_10_201611041/


So all relation directory is removed!
It happens only for first table created in tablespace.
If you create table in Postgres data directory everything is ok: first 
segment of relation is truncated but not deleted.
Also if you create one more unlogged table in tablespace it is truncated 
correctly:


postgres=# set default_tablespace=fs;
SET
postgres=# create unlogged table foo1(x integer);
CREATE TABLE
postgres=# insert into foo1 values(generate_series(1,10));
INSERT 0 10
postgres=# \q
knizhnik@knizhnik:~/dtm-data$ pkill -9 postgres
knizhnik@knizhnik:~/dtm-data$ rm -f logfile ; pg_ctl -D pgsql.master -l 
logfile start

pg_ctl: another server might be running; trying to start server anyway
server starting
knizhnik@knizhnik:~/dtm-data$ psql postgres
psql (10devel)
Type "help" for help.

postgres=# select * from foo1;
 x
---
(0 rows)

knizhnik@knizhnik:~/dtm-data$ ls -l fs/PG_10_201611041/12289/*
-rw--- 1 knizhnik knizhnik 0 Nov  9 19:52 fs/PG_10_201611041/12289/32768
-rw--- 1 knizhnik knizhnik 0 Nov  9 19:52 
fs/PG_10_201611041/12289/32768_init



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Unlogged tables cleanup

2016-11-09 Thread konstantin knizhnik


On Nov 10, 2016, at 10:17 AM, Michael Paquier wrote:

> 
> Hm.. I cannot reproduce what you see on Linux or macos. Perhaps you
> have locally a standby pointing as well to this tablespace?

No, it is latest sources from Postgres repository.
Please notice that you should create new database and tablespace to reproduce 
this issue.
So actually the whole sequence is

mkdir fs
initdb -D pgsql 
pg_ctl -D pgsql -l logfile start
psql postgres
# create tablespace fs location '/home/knizhnik/dtm-data/fs';
# set default_tablespace=fs;
# create unlogged table foo(x integer);
# insert into foo values(generate_series(1,10));
# ^D
pkill -9 postgres
pg_ctl -D pgsql -l logfile start
# select * from foo;


> -- 
> Michael



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Two questions about Postgres parser

2017-02-27 Thread Konstantin Knizhnik


Hi hackers,

Working on vectorized extension for Postgres (VOPS) I faced with two 
things in Postgres compiler which break my expectations and force me to 
abandon my original implementation plan. I wonder if it is really 
principle and correct that:


1. Moving-aggregate implementation should return the same type as plain 
implementation. Yes, in most cases it is hard to find arguments why them 
should return different types. But it is not true for vectorized 
operations...


2. Implicit user defined type casts are not applied for COALESCE operator:

create  type complex as (x float8, y float8);
create function float2complex(x float8) returns complex as $$ 
declare complex c; begin c.x := x; x.y = 0; return c; $$ language 
plpgsql strict immutable;
create cast (float8 as complex) with function float2complex(float8) 
as implicit;

create table foo(c complex);
select coalesce(c, 0.0) from foo;
ERROR:  COALESCE types complex and numeric cannot be matched
LINE 1: select coalesce(c, 0.0) from foo;

select coalesce(c, 0.0::float8) from foo;
ERROR:  COALESCE types complex and double precision cannot be matched
LINE 1: select coalesce(c, 0.0::float8) from foo;

select coalesce(c, 0.0::float8::complex) from foo;
 coalesce
--
(0 rows)

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Faster Expression Processing v4

2017-03-13 Thread Konstantin Knizhnik




On 13.03.2017 11:03, Andres Freund wrote:

Hi,

On 2017-03-12 05:40:51 +0100, Tomas Vondra wrote:

I wanted to do a bit of testing and benchmarking on this, but 0004 seems to
be a bit broken.

Well, "broken" in the sense that it's already outdated, because other
stuff that got merged.



The patch does not apply anymore - there are some conflicts
in execQual.c, but I think I fixed those. But then I ran into a bunch of
compile-time errors, because some of the executor nodes still reference bits
that were moved elsewhere.

Updated patch attached.  Note that this patch has two changes I've not
yet evaluated performance-wise.



I got the following results at my system with Intel(R) Core(TM) i7-4770 
CPU @ 3.40GHz, 16Gb RAM,
TPC-H Q1/Q6 scale 10, sharedBuffers=8Gb, pg_prewarm on lineitem table 
projection:



Q1
Q6
Master
7503 ms 1171 ms
Your patch  6420 ms 1034 ms
VOPS
396 ms
249 ms
VOPS + patch367 ms
    233 ms



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

[HACKERS] Parallel query execution with SPI

2017-03-31 Thread Konstantin Knizhnik


Hi hackers,

It is possible to execute query concurrently using SPI?
If so, how it can be enforced?
I tried to open cursor with CURSOR_OPT_PARALLEL_OK flag but it doesn't 
help: query is executed by single backend while the same query been 
launched at top level uses parallel plan:


fsstate->portal = SPI_cursor_open_with_args(NULL, fsstate->query, 
fsstate->numParams, argtypes, values, nulls, true, CURSOR_OPT_PARALLEL_OK);

...
SPI_cursor_fetch(fsstate->portal, true, 1);

Thanks in advance,

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel query execution with SPI

2017-03-31 Thread Konstantin Knizhnik




On 31.03.2017 13:48, Robert Haas wrote:

On Fri, Mar 31, 2017 at 3:33 AM, Konstantin Knizhnik
 wrote:

It is possible to execute query concurrently using SPI?
If so, how it can be enforced?
I tried to open cursor with CURSOR_OPT_PARALLEL_OK flag but it doesn't help:
query is executed by single backend while the same query been launched at
top level uses parallel plan:

 fsstate->portal = SPI_cursor_open_with_args(NULL, fsstate->query,
fsstate->numParams, argtypes, values, nulls, true, CURSOR_OPT_PARALLEL_OK);
 ...
 SPI_cursor_fetch(fsstate->portal, true, 1);

Parallel execution isn't possible if you are using a cursor-type
interface, because a parallel query can't be suspended and resumed
like a non-parallel query.  If you use a function that executes the
query to completion in one go, like SPI_execute_plan, then it's cool.
See also commit 61c2e1a95f94bb904953a6281ce17a18ac38ee6d.


Thank you very much for explanation.
In case of using SPI_execute the query is really executed concurrently.
But it means that when I am executing some query using SPI, I need to 
somehow predict number of returned tuples.
If it is not so much, then it is better to use SPI_execute to allow 
concurrent execution of the query.
But if it is large enough, then SPI_execute without limit can cause 
memory overflow.
Certainly I can specify some reasonable limit and it if is reached, then 
use cursor instead.

But it is neither convenient, neither efficient.

I wonder if somebody can suggest better solution?

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] FDW and parallel execution

2017-04-02 Thread Konstantin Knizhnik

->  Hash Join 
(cost=20.05..40.79 rows=15 width=36)
 Hash Cond: 
(nation.n_regionkey = region.r_regionkey)
 ->  Seq Scan on nation 
 (cost=0.00..17.70 rows=770 width=40)
 -> Hash  
(cost=20.00..20.00 rows=4 width=4)
->  Seq Scan on region  (cost=0.00..20.00 rows=4 width=4)
Filter: ((r_name)::text = 'ASIA'::text)
   ->  Hash  (cost=294718.76..294718.76 rows=2284376 
width=8)
 ->  Foreign Scan on orders_fdw 
(cost=0.00..294718.76 rows=2284376 width=8)
 ->  Hash  (cost=32605.64..32605.64 rows=1500032 width=8)
   ->  Foreign Scan on customer_fdw 
(cost=0.00..32605.64 rows=1500032 width=8)

The plans look very similar, but first one is parallel and second - not.
My FDW provides implementation for IsForeignScanParallelSafe which returns true.
I wonder what can prevent optimizer from using parallel plan in this case?

Thank in advance,

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] FDW and parallel execution

2017-04-11 Thread Konstantin Knizhnik




On 04.04.2017 13:29, Kyotaro HORIGUCHI wrote:

Hi,

At Sun, 02 Apr 2017 16:30:24 +0300, Konstantin Knizhnik  
wrote in <58e0fcf0.2070...@postgrespro.ru>

Hi hackers and personally Robet (you are the best expert in both
areas).
I want to ask one more question concerning parallel execution and FDW.
Below are two plans for the same query (TPC-H Q5): one for normal
tables, another for FDW to vertical representation of the same data.
FDW supports analyze function and is expected to produce the similar
statistic as for original tables.



The plans look very similar, but first one is parallel and second -
not.
My FDW provides implementation for IsForeignScanParallelSafe which
returns true.
I wonder what can prevent optimizer from using parallel plan in this
case?

Parallel execution requires partial paths. It's the work for
GetForeignPaths of your FDW.


Thank you very much for explanation.
But unfortunately I still do not completely understand what kind of 
queries allow parallel execution with FDW.


Section "FDW Routines for Parallel Execution" of FDW specification says:
A ForeignScan node can, optionally, support parallel execution. A 
parallel ForeignScan will be executed in multiple processes and should 
return each row only once across all cooperating processes. To do 
this, processes can coordinate through fixed size chunks of dynamic 
shared memory. This shared memory is not guaranteed to be mapped at 
the same address in every process, so pointers may not be used. The 
following callbacks are all optional in general, but required if 
parallel execution is to be supported.


I provided IsForeignScanParallelSafe, EstimateDSMForeignScan, 
InitializeDSMForeignSca and InitializeWorkerForeignScan in my FDW.

IsForeignScanParallelSafe returns true.
Also in GetForeignPaths function I created path with 
baserel->consider_parallel == true.

Is it enough or I should do something else?

But unfortunately I failed to find any query: sequential scan, grand 
aggregation, aggregation with group by, joins... when parallel execution 
plan is used for this FDW.
Also there are no examples of using this functions in Postgres 
distributive and I failed to find any such examples in Internet.


Can somebody please clarify my situation with parallel execution and FDW 
and may be point at some examples?

Thank in advance.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] ASOF join

2017-06-19 Thread Konstantin Knizhnik




On 16.06.2017 19:07, David Fetter wrote:

On Fri, Jun 16, 2017 at 11:51:34AM +1200, Thomas Munro wrote:

On Fri, Jun 16, 2017 at 4:20 AM, Konstantin Knizhnik
 wrote:

I wonder if there were some discussion/attempts to add ASOF join to Postgres
(sorry, may be there is better term for it, I am refereeing KDB definition:
http://code.kx.com/wiki/Reference/aj ).

Interesting idea.  Also in Pandas:

http://pandas.pydata.org/pandas-docs/version/0.19.0/generated/pandas.merge_asof.html#pandas.merge_asof


I attached simple patch adding ASOF join to Postgres. Right now it 
support only outer join and requires USING clause (consequently it is 
not possible to join two tables which joi keys has different names. May 
be it is also possible to support ON clause with condition written like 
o.k1 = i.k2 AND o.k2 = i.k2 AND ... AND o.kN >= i.kN
But such notation can be confusing, because join result includes only 
one matching inner record with kN smaller or equal than kN of outer 
record and not all such records.

As alternative we can add specia

If people fin such construction really useful, I will continue work on it.




I suppose you could write a function that pulls tuples out of a bunch
of cursors and zips them together like this, as a kind of hand-coded
special merge join "except that we match on nearest key rather than
equal keys" (as they put it).

I've written code like this before in a trading context, where we
called that 'previous tick interpolation', and in a scientific context
where other kinds of interpolation were called for (so not really
matching a tuple but synthesising one if no exact match).  If you view
the former case as a kind of degenerate case of interpolation then it
doesn't feel like a "join" as we know it, but clearly it is.  I had
never considered before that such things might belong inside the
database as a kind of join operator.

If you turn your head sideways, it's very similar to the range merge
join Jeff Davis proposed.  https://commitfest.postgresql.org/14/1106/


May be, but I do not understand how to limit result to contain exactly 
one (last) inner tuple for each outer tuple.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c
index 482a3dd..f7a8f38 100644
--- a/contrib/postgres_fdw/deparse.c
+++ b/contrib/postgres_fdw/deparse.c
@@ -1324,6 +1324,9 @@ get_jointype_name(JoinType jointype)
 		case JOIN_FULL:
 			return "FULL";
 
+		case JOIN_ASOF:
+			return "ASOF";
+
 		default:
 			/* Shouldn't come here, but protect from buggy code. */
 			elog(ERROR, "unsupported join type %d", jointype);
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 080cb0a..54cf6c1 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -4073,7 +4073,7 @@ foreign_join_ok(PlannerInfo *root, RelOptInfo *joinrel, JoinType jointype,
 	 * Constructing queries representing SEMI and ANTI joins is hard, hence
 	 * not considered right now.
 	 */
-	if (jointype != JOIN_INNER && jointype != JOIN_LEFT &&
+	if (jointype != JOIN_INNER && jointype != JOIN_LEFT && jointype != JOIN_ASOF && 
 		jointype != JOIN_RIGHT && jointype != JOIN_FULL)
 		return false;
 
@@ -4211,6 +4211,7 @@ foreign_join_ok(PlannerInfo *root, RelOptInfo *joinrel, JoinType jointype,
 			break;
 
 		case JOIN_LEFT:
+		case JOIN_ASOF:
 			fpinfo->joinclauses = list_concat(fpinfo->joinclauses,
 		  list_copy(fpinfo_i->remote_conds));
 			fpinfo->remote_conds = list_concat(fpinfo->remote_conds,
diff --git a/doc/src/sgml/ref/select.sgml b/doc/src/sgml/ref/select.sgml
index 211e4c3..fd3be8c 100644
--- a/doc/src/sgml/ref/select.sgml
+++ b/doc/src/sgml/ref/select.sgml
@@ -514,6 +514,9 @@ TABLE [ ONLY ] table_name [ * ]
  
   CROSS JOIN
  
+ 
+  ASOF [ OUTER ] JOIN
+ 
 
 
 For the INNER and OUTER join types, a
@@ -523,7 +526,9 @@ TABLE [ ONLY ] table_name [ * ]
 USING (join_column [, ...]).
 See below for the meaning.  For CROSS JOIN,
-none of these clauses can appear.
+none of these clauses can appear. For ASOF join type, a
+join condition must be USING (join_column [, ...]).

 

@@ -571,6 +576,32 @@ TABLE [ ONLY ] table_name [ * ]
 on the right), plus one row for each unmatched right-hand row
 (extended with nulls on the left).

+
+   ASOF OUTER JOIN is similar to LEFT OUTER JOIN but it accepts only
+USING (join_column_1 [, ...], join_column_N) clause
+where last joined column join_column_N is expected to be timestamp 
+(but actually can have any comparable type) and outer tuple is matched with only one inner tuple with the s

Re: [HACKERS] ASOF join

2017-06-21 Thread Konstantin Knizhnik




On 21.06.2017 11:00, Thomas Munro wrote:

Hmm.  Yeah, I see the notational problem.  It's hard to come up with a
new syntax that has SQL nature.  What if... we didn't use a new syntax
at all, but recognised existing queries that are executable with this
strategy?  Queries like this:

WITH ticks(time, price) AS
(VALUES ('2017-07-20 12:00:00'::timestamptz, 100.00),
('2017-07-21 11:00:00'::timestamptz, 150.00)),
  times(time) AS
(VALUES ('2017-07-19 12:00:00'::timestamptz),
('2017-07-20 12:00:00'::timestamptz),
('2017-07-21 12:00:00'::timestamptz),
('2017-07-22 12:00:00'::timestamptz))

SELECT times.time, previous_tick.price
   FROM times
   LEFT JOIN LATERAL (SELECT * FROM ticks
   WHERE ticks.time <= times.time
   ORDER BY ticks.time DESC LIMIT 1) previous_tick ON true
  ORDER BY times.time;

   time  | price
+
  2017-07-19 12:00:00+12 |
  2017-07-20 12:00:00+12 | 100.00
  2017-07-21 12:00:00+12 | 150.00
  2017-07-22 12:00:00+12 | 150.00
(4 rows)

I haven't used LATERAL much myself but I've noticed that it's often
used to express this type of thing.  "Get me the latest ... as of time
...".

It'd a bit like the way we recognise EXISTS (...) as a semi-join and
execute it with a join operator instead of having a SEMI JOIN syntax.
On the other hand it's a bit more long winded, extreme and probably
quite niche.
Thank you for this idea. I agree that it is the best way of implementing 
ASOF join - just as optimization of standard SQL query.
But do you think that still it will be good idea to extend SQL syntax 
with ASOF JOIN ... USING ... clause? It will significantly simplify 
writing queries like above
and IMHO doesn't introduce some confusions with standard SQL syntax. My 
primary idea of suggesting ASOF join for Postgres was not  just building 
more efficient plan (using merge join instead of nested loop) but also 
simplifying writing of such queries. Or do you think that nobody will be 
interested in non-standard SQL extensions?


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Error : undefined symbol : LWLockAssign in 9.6.3

2017-08-08 Thread Konstantin Knizhnik


On 08/09/2017 07:07 AM, 송기훈 wrote:

본문 이미지 1
Hi.
I'm trying to use imcs module with 9.6 and got this error message. LWLockAssign 
function has been deleted from 9.6. I can't use this module anymore from 9.6.

What I want to ask you something is that your team decides not to support imcs 
module anymore or doesn't concern about imcs module or are there any ways to 
run postgresql in memory only?


Hi,
I am author of IMCS module and performing support of it.
Please contact to me directly.
I have committed patch in https://github.com/knizhnik/imcs.git repository
which allows to use IMCS with 9.6.3 and later Postgres versions.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

[HACKERS] Secondary index access optimizations

2017-08-14 Thread Konstantin Knizhnik


Hi hackers,

I am trying to compare different ways of optimizing work with huge 
append-only tables in PostgreSQL where primary key is something like 
timestamp and queries are usually accessing most recent data using some 
secondary keys. Size of secondary index is one of the most critical 
factors limiting  insert/search performance. As far as data is inserted 
in timestamp ascending order, access to primary key is well localized 
and accessed tables are present in memory. But if we create secondary 
key for the whole table, then access to it will require random reads 
from the disk and significantly decrease performance.


There are two well known solutions of the problem:
1. Table partitioning
2. Partial indexes

This approaches I want to compare. First of all I want to check if 
optimizer is able to generate efficient query execution plan covering 
different time intervals.

Unfortunately in both cases generated plan is not optimal.

1. Table partitioning:

create table base (k integer primary key, v integer);
create table part1 (check (k between 1 and 1)) inherits (base);
create table part2 (check (k between 10001 and 2)) inherits (base);
create index pi1 on part1(v);
create index pi2 on part2(v);
insert int part1 values (generate series(1,1), random());
insert into part2 values (generate_series(10001,2), random());
explain select * from base where k between 1 and 2 and v = 100;
  QUERY PLAN
---
 Append  (cost=0.00..15.65 rows=3 width=8)
   ->  Seq Scan on base  (cost=0.00..0.00 rows=1 width=8)
 Filter: ((k >= 1) AND (k <= 2) AND (v = 100))
   ->  Index Scan using pi1 on part1  (cost=0.29..8.31 rows=1 width=8)
 Index Cond: (v = 100)
 Filter: ((k >= 1) AND (k <= 2))
   ->  Index Scan using pi2 on part2  (cost=0.29..7.34 rows=1 width=8)
 Index Cond: (v = 100)
 Filter: ((k >= 1) AND (k <= 2))

Questions:
- Is there some way to avoid sequential scan of parent table? Yes, it is 
empty and so sequential scan will not take much time, but ... it still 
requires some additional actions and so increasing query execution time.
- Why index scan of partition indexes includes filter condition if it is 
guaranteed by check constraint that all records of this partition match 
search predicate?



2. Partial indexes:

create table t (k integer primary key, v integer);
insert into t values (generate_series(1,2),random());
create index i1 on t(v) where k between 1 and 1;
create index i2 on t(v) where k between 10001 and 2;
postgres=# explain select * from t where k between 1 and 1 and v = 100;
 QUERY PLAN

 Index Scan using i1 on t  (cost=0.29..7.28 rows=1 width=8)
   Index Cond: (v = 100)
(2 rows)


Here we get perfect plan. Let's try to extend search interval:


postgres=# explain select * from t where k between 1 and 2 and v = 100;
QUERY PLAN
--
 Index Scan using t_pkey on t  (cost=0.29..760.43 rows=1 width=8)
   Index Cond: ((k >= 1) AND (k <= 2))
   Filter: (v = 100)
(3 rows)

Unfortunately in this case Postgres is not able to apply partial indexes.
And this is what I expected to get:

postgres=# explain select * from t where k between 1 and 1 and v = 
100 union all select * from t where k between 10001 and 2 and v = 100;

  QUERY PLAN
--
 Append  (cost=0.29..14.58 rows=2 width=8)
   ->  Index Scan using i1 on t  (cost=0.29..7.28 rows=1 width=8)
 Index Cond: (v = 100)
   ->  Index Scan using i2 on t t_1  (cost=0.29..7.28 rows=1 width=8)
 Index Cond: (v = 100)


I wonder if there are some principle problems in supporting this two 
things in optimizer:
1. Remove search condition for primary key if it is fully satisfied by 
derived table check constraint.
2. Append index scans of several partial indexes if specified interval 
is covered by their conditions.


I wonder if someone is familiar with this part of optimizer ad can 
easily fix it.
Otherwise I am going to spend some time on solving this problems (if 
community think that such optimizations will be useful).


--

Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Secondary index access optimizations

2017-08-14 Thread Konstantin Knizhnik




On 14.08.2017 12:37, Konstantin Knizhnik wrote:

Hi hackers,

I am trying to compare different ways of optimizing work with huge 
append-only tables in PostgreSQL where primary key is something like 
timestamp and queries are usually accessing most recent data using 
some secondary keys. Size of secondary index is one of the most 
critical factors limiting  insert/search performance. As far as data 
is inserted in timestamp ascending order, access to primary key is 
well localized and accessed tables are present in memory. But if we 
create secondary key for the whole table, then access to it will 
require random reads from the disk and significantly decrease 
performance.


There are two well known solutions of the problem:
1. Table partitioning
2. Partial indexes

This approaches I want to compare. First of all I want to check if 
optimizer is able to generate efficient query execution plan covering 
different time intervals.

Unfortunately in both cases generated plan is not optimal.

1. Table partitioning:

create table base (k integer primary key, v integer);
create table part1 (check (k between 1 and 1)) inherits (base);
create table part2 (check (k between 10001 and 2)) inherits (base);
create index pi1 on part1(v);
create index pi2 on part2(v);
insert int part1 values (generate series(1,1), random());
insert into part2 values (generate_series(10001,2), random());
explain select * from base where k between 1 and 2 and v = 100;
  QUERY PLAN
---
 Append  (cost=0.00..15.65 rows=3 width=8)
   ->  Seq Scan on base  (cost=0.00..0.00 rows=1 width=8)
 Filter: ((k >= 1) AND (k <= 2) AND (v = 100))
   ->  Index Scan using pi1 on part1  (cost=0.29..8.31 rows=1 width=8)
 Index Cond: (v = 100)
 Filter: ((k >= 1) AND (k <= 2))
   ->  Index Scan using pi2 on part2  (cost=0.29..7.34 rows=1 width=8)
 Index Cond: (v = 100)
 Filter: ((k >= 1) AND (k <= 2))

Questions:
- Is there some way to avoid sequential scan of parent table? Yes, it 
is empty and so sequential scan will not take much time, but ... it 
still requires some additional actions and so increasing query 
execution time.
- Why index scan of partition indexes includes filter condition if it 
is guaranteed by check constraint that all records of this partition 
match search predicate?



2. Partial indexes:

create table t (k integer primary key, v integer);
insert into t values (generate_series(1,2),random());
create index i1 on t(v) where k between 1 and 1;
create index i2 on t(v) where k between 10001 and 2;
postgres=# explain select * from t where k between 1 and 1 and v = 
100;

 QUERY PLAN

 Index Scan using i1 on t  (cost=0.29..7.28 rows=1 width=8)
   Index Cond: (v = 100)
(2 rows)


Here we get perfect plan. Let's try to extend search interval:


postgres=# explain select * from t where k between 1 and 2 and v = 
100;

QUERY PLAN
--
 Index Scan using t_pkey on t  (cost=0.29..760.43 rows=1 width=8)
   Index Cond: ((k >= 1) AND (k <= 2))
   Filter: (v = 100)
(3 rows)

Unfortunately in this case Postgres is not able to apply partial indexes.
And this is what I expected to get:

postgres=# explain select * from t where k between 1 and 1 and v = 
100 union all select * from t where k between 10001 and 2 and v = 
100;

  QUERY PLAN
--
 Append  (cost=0.29..14.58 rows=2 width=8)
   ->  Index Scan using i1 on t  (cost=0.29..7.28 rows=1 width=8)
 Index Cond: (v = 100)
   ->  Index Scan using i2 on t t_1  (cost=0.29..7.28 rows=1 width=8)
 Index Cond: (v = 100)


I wonder if there are some principle problems in supporting this two 
things in optimizer:
1. Remove search condition for primary key if it is fully satisfied by 
derived table check constraint.
2. Append index scans of several partial indexes if specified interval 
is covered by their conditions.


I wonder if someone is familiar with this part of optimizer ad can 
easily fix it.
Otherwise I am going to spend some time on solving this problems (if 
community think that such optimizations will be useful).




Replying to myself: the following small patch removes redundant checks 
from index scans for derived tables:



diff --git a/src/backend/optimizer/util/plancat.c 
b/src/backend/optimizer/util/plancat.c

index 939045d..1f7c9cf 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -1441,6 +1441,20 @@ relation_excluded_by_constraints(PlannerInfo *root,
if (predicate_refuted_by(safe_constraints, 
rel->baseres

Re: [HACKERS] Secondary index access optimizations

2017-08-16 Thread Konstantin Knizhnik


On 14.08.2017 19:33, Konstantin Knizhnik wrote:



On 14.08.2017 12:37, Konstantin Knizhnik wrote:

Hi hackers,

I am trying to compare different ways of optimizing work with huge 
append-only tables in PostgreSQL where primary key is something like 
timestamp and queries are usually accessing most recent data using 
some secondary keys. Size of secondary index is one of the most 
critical factors limiting  insert/search performance. As far as data 
is inserted in timestamp ascending order, access to primary key is 
well localized and accessed tables are present in memory. But if we 
create secondary key for the whole table, then access to it will 
require random reads from the disk and significantly decrease 
performance.


There are two well known solutions of the problem:
1. Table partitioning
2. Partial indexes

This approaches I want to compare. First of all I want to check if 
optimizer is able to generate efficient query execution plan covering 
different time intervals.

Unfortunately in both cases generated plan is not optimal.

1. Table partitioning:

create table base (k integer primary key, v integer);
create table part1 (check (k between 1 and 1)) inherits (base);
create table part2 (check (k between 10001 and 2)) inherits (base);
create index pi1 on part1(v);
create index pi2 on part2(v);
insert int part1 values (generate series(1,1), random());
insert into part2 values (generate_series(10001,2), random());
explain select * from base where k between 1 and 2 and v = 100;
  QUERY PLAN
---
 Append  (cost=0.00..15.65 rows=3 width=8)
   ->  Seq Scan on base  (cost=0.00..0.00 rows=1 width=8)
 Filter: ((k >= 1) AND (k <= 2) AND (v = 100))
   ->  Index Scan using pi1 on part1  (cost=0.29..8.31 rows=1 width=8)
 Index Cond: (v = 100)
 Filter: ((k >= 1) AND (k <= 2))
   ->  Index Scan using pi2 on part2  (cost=0.29..7.34 rows=1 width=8)
 Index Cond: (v = 100)
 Filter: ((k >= 1) AND (k <= 2))

Questions:
- Is there some way to avoid sequential scan of parent table? Yes, it 
is empty and so sequential scan will not take much time, but ... it 
still requires some additional actions and so increasing query 
execution time.
- Why index scan of partition indexes includes filter condition if it 
is guaranteed by check constraint that all records of this partition 
match search predicate?



2. Partial indexes:

create table t (k integer primary key, v integer);
insert into t values (generate_series(1,2),random());
create index i1 on t(v) where k between 1 and 1;
create index i2 on t(v) where k between 10001 and 2;
postgres=# explain select * from t where k between 1 and 1 and v 
= 100;

 QUERY PLAN

 Index Scan using i1 on t  (cost=0.29..7.28 rows=1 width=8)
   Index Cond: (v = 100)
(2 rows)


Here we get perfect plan. Let's try to extend search interval:


postgres=# explain select * from t where k between 1 and 2 and v 
= 100;

QUERY PLAN
--
 Index Scan using t_pkey on t  (cost=0.29..760.43 rows=1 width=8)
   Index Cond: ((k >= 1) AND (k <= 2))
   Filter: (v = 100)
(3 rows)

Unfortunately in this case Postgres is not able to apply partial 
indexes.

And this is what I expected to get:

postgres=# explain select * from t where k between 1 and 1 and v 
= 100 union all select * from t where k between 10001 and 2 and v 
= 100;

  QUERY PLAN
--
 Append  (cost=0.29..14.58 rows=2 width=8)
   ->  Index Scan using i1 on t  (cost=0.29..7.28 rows=1 width=8)
 Index Cond: (v = 100)
   ->  Index Scan using i2 on t t_1  (cost=0.29..7.28 rows=1 width=8)
 Index Cond: (v = 100)


I wonder if there are some principle problems in supporting this two 
things in optimizer:
1. Remove search condition for primary key if it is fully satisfied 
by derived table check constraint.
2. Append index scans of several partial indexes if specified 
interval is covered by their conditions.


I wonder if someone is familiar with this part of optimizer ad can 
easily fix it.
Otherwise I am going to spend some time on solving this problems (if 
community think that such optimizations will be useful).




Replying to myself: the following small patch removes redundant checks 
from index scans for derived tables:



diff --git a/src/backend/optimizer/util/plancat.c 
b/src/backend/optimizer/util/plancat.c

index 939045d..1f7c9cf 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -1441,6 +1441,20 @@ relation_excluded_by_constraints(PlannerInfo 
*root,
if (predicate_refuted

Re: [HACKERS] [PATCH] Push limit to sort through a subquery

2017-08-17 Thread Konstantin Knizhnik




On 29.04.2017 00:13, Douglas Doole wrote:


If you add this to the commitfest app, more people might look at
it when the next commitfest opens.


I have added it. https://commitfest.postgresql.org/14/1119/

Also, it might help if you can provide a query/ies with numbers
where this optimization shows improvement.


I can't provide the real queries where we encountered the problem 
because they are internal. However I showed a simplified version of 
the queries in my first post.


On our queries, the change made quite a difference - execution time 
dropped from 31.4 seconds to 7.2 seconds. Explain analyze also shows 
that memory use dropped significantly and we didn't have to spill the 
sort to disk


From:

-> Sort (cost=989.95..1013.27 rows=9326 width=30) 
(node_startup_time/loop=31328.891, node_total_time/loop: 31329.756 
rows=2001 loops=1) Buffers: temp read=772 written=11201 lsm_bufmgr 
hits=3392 Sort Key: *** Sort Method: external merge Sort Space Used: 
89592 Sort Space Type: Disk


To:

-> Sort (cost=989.95..1013.27 rows=9326 width=30) 
(node_startup_time/loop=7123.275, node_total_time/loop: 7123.504 
rows=2001 loops=1) Buffers: lsm_bufmgr hits=3387 Sort Key: *** Sort 
Method: top-N heapsort Sort Space Used: 3256 Sort Space Type: Memory


Attached please find yet another small patch which pushes down LIMIT to 
ForeignScan.
I should notice that currently Postgres optimizer is using "Merge 
Append" and fetches from remote nodes only required number of tuples.
So even without LIMIT push down, postgres_fdw will not pull the whole 
table from remote host.
postgres_fdw is using cursor for fetching data from remote. Default 
fetch size is 100, so even without limit remote query will fetch no 
more  than 100 rows at remote site.


Assume the following example:

postgres=# create extension postgres_fdw;
postgres=# create server shard1  FOREIGN DATA WRAPPER postgres_fdw 
options(dbname 'postgres', host 'localhost', port '5432');
postgres=# create server shard2  FOREIGN DATA WRAPPER postgres_fdw 
options(dbname 'postgres', host 'localhost', port '5432');
postgres=# CREATE USER MAPPING for $user SERVER shard1 options (user 
'$user');
postgres=# CREATE USER MAPPING for $user SERVER shard1 options (user 
'$user');

postgres=# CREATE TABLE t(u integer primary key, v integer);
postgres=# CREATE TABLE t1(u integer primary key, v integer);
postgres=# CREATE TABLE t2(u integer primary key, v integer);
postgres=# insert into t1 values (generate_series(1,10), 
random()*10);
postgres=# insert into t2 values (generate_series(1,10), 
random()*10);
postgres=# CREATE FOREIGN TABLE t_fdw1() inherits (t) server shard1 
options(table_name 't1');
postgres=# CREATE FOREIGN TABLE t_fdw2() inherits (t) server shard2 
options(table_name 't2');



postgres=# explain analyze select * from t order by u limit 1;
  QUERY PLAN
---
 Limit  (cost=200.15..200.20 rows=1 width=8) (actual time=2.010..2.010 
rows=1 loops=1)
   ->  Merge Append  (cost=200.15..449.39 rows=5121 width=8) (actual 
time=2.009..2.009 rows=1 loops=1)

 Sort Key: t.u
 ->  Index Scan using t_pkey on t  (cost=0.12..8.14 rows=1 
width=8) (actual time=0.005..0.005 rows=0 loops=1)
 ->  Foreign Scan on t_fdw2  (cost=100.00..193.92 rows=2560 
width=8) (actual time=1.074..1.074rows=1 loops=1)
 ->  Foreign Scan on t_fdw1  (cost=100.00..193.92 rows=2560 
width=8) (actual time=0.928..0.928rows=1 loops=1)

 Planning time: 0.769 ms
 Execution time: 6.837 ms
(8 rows)

As you can see foreign scan fetches only one row from each remote node.

But still pushing down limit can have positive effect on performance, 
especially if SORT can be replaced with TOP-N.

I got the following results (time in seconds):

Query
original
limit push down
select * from t order by u limit 1
2.276
1.777
select * from t order by v limit 1
100 42


There is index for "u", so fetching records with smallest "u" values can 
be done without sorting, so times are similar.
But in case of sorting by "v", pushing down limit allows to use TOP-1 
instead of global sort and it reduces query execution time more than 2 
times.


--

Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 080cb0a..e3847ce 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -2949,7 +2949,8 @@ create_cursor(ForeignScanState *node)
 	initStringInfo(&buf);
 	appendStringInfo(&buf, "DECLARE c%u CURSOR FOR\n%s",

[HACKERS] Cached plans and statement generalization

2017-04-24 Thread Konstantin Knizhnik


Hi hackers,

There were a lot of discussions about query plan caching in hackers 
mailing list, but I failed to find some clear answer for my question and 
the current consensus on this question in Postgres community. As far as 
I understand current state is the following:

1. We have per-connection prepared statements.
2. Queries executed inside plpgsql code are implicitly prepared.

It is not always possible or convenient to use prepared statements.
For example, if pgbouncer is used to perform connection pooling.
Another use case (which is actually the problem I am trying to solve 
now) is partitioning.
Efficient execution of query to partitioned table requires hardcoded 
value for partitioning key.
Only in this case optimizer will be able to construct efficient query 
plan which access only affected tables (partitions).


My small benchmark for distributed partitioned table based on pg_pathman 
+ postgres_fdw shows 3 times degrade of performance in case of using 
prepared statements.
But without prepared statements substantial amount of time is spent in 
query compilation and planning. I was be able to speed up benchmark more 
than two time by

sending prepared queries directly to the remote nodes.

So what I am thinking now is implicit query caching. If the same query 
with different literal values is repeated many times, then we can try to 
generalize this query and replace it with prepared query with 
parameters. I am not considering now shared query cache: is seems to be 
much harder to implement. But local caching of generalized queries seems 
to be not so difficult to implement and requires not so much changes in 
Postgres code. And it can be useful not only for sharding, but for many 
other cases where prepared statements can not be used.


I wonder if such option was already considered and if it was for some 
reasons rejected: can you point me at this reasons?


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Cached plans and statement generalization

2017-04-24 Thread Konstantin Knizhnik




On 24.04.2017 13:24, Alexander Korotkov wrote:

Hi, Konstantin!

On Mon, Apr 24, 2017 at 11:46 AM, Konstantin Knizhnik 
mailto:k.knizh...@postgrespro.ru>> wrote:


There were a lot of discussions about query plan caching in
hackers mailing list, but I failed to find some clear answer for
my question and the current consensus on this question in Postgres
community. As far as I understand current state is the following:
1. We have per-connection prepared statements.
2. Queries executed inside plpgsql code are implicitly prepared.

It is not always possible or convenient to use prepared statements.
For example, if pgbouncer is used to perform connection pooling.
Another use case (which is actually the problem I am trying to
solve now) is partitioning.
Efficient execution of query to partitioned table requires
hardcoded value for partitioning key.
Only in this case optimizer will be able to construct efficient
query plan which access only affected tables (partitions).

My small benchmark for distributed partitioned table based on
pg_pathman + postgres_fdw shows 3 times degrade of performance in
case of using prepared statements.
But without prepared statements substantial amount of time is
spent in query compilation and planning. I was be able to speed up
benchmark more than two time by
sending prepared queries directly to the remote nodes.


I don't think it's correct to ask PostgreSQL hackers about problem 
which arises with pg_pathman while pg_pathman is an extension 
supported by Postgres Pro.
Since we have declarative partitioning committed to 10, I think that 
community should address this issue in the context of declarative 
partitioning.
However, it's unlikely we can spot this issue with declarative 
partitioning because it still uses very inefficient constraint 
exclusion mechanism.  Thus, issues you are writing about would become 
visible on declarative partitioning only when constraint exclusion 
would be replaced with something more efficient.


Long story short, could you reproduce this issue without pg_pathman?



Sorry, I have mentioned pg_pathman just as example.
The same problems takes place with partitioning based on standard 
Postgres inheritance mechanism (when I manually create derived tables 
and specify constraints for them).
I didn't test yet declarative partitioning committed to 10, but I expect 
the that it will also suffer from this problem (because is based on 
inheritance).
But as I wrote, I think that the problem with plan caching is wider and 
is not bounded just to partitioning.





--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com 
<http://www.postgrespro.com/>

The Russian Postgres Company




--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Cached plans and statement generalization


On 24.04.2017 21:43, Andres Freund wrote:

Hi,

On 2017-04-24 11:46:02 +0300, Konstantin Knizhnik wrote:

So what I am thinking now is implicit query caching. If the same query with
different literal values is repeated many times, then we can try to
generalize this query and replace it with prepared query with
parameters.

That's not actuall all that easy:
- You pretty much do parse analysis to be able to do an accurate match.
   How much overhead is parse analysis vs. planning in your cases?
- The invalidation infrastructure for this, if not tied to to fully
   parse-analyzed statements, is going to be hell.
- Migrating to parameters can actually cause significant slowdowns, not
   nice if that happens implicitly.


Well, first of all I want to share results I already get: pgbench with 
default parameters,  scale 10 and one connection:


protocol
TPS
simple
3492
extended
2927
prepared
6865
simple + autoprepare
6844


So autoprepare is as efficient as explicit prepare and can increase 
performance almost two times.


My current implementation is replacing with parameters only string 
literals in the query, i.e. select * from T where x='123'; -> select * 
from T where x=$1;
It greatly simplifies matching of parameters - it is just necessary to 
locate '\'' character and then correctly handle pairs  of quotes.

Handling of integer and real literals is really challenged task.
One source of problems is negation: it is not so easy to correctly 
understand whether minus should be treated as part of literal or as 
operator:

(-1), (1-1), (1-1)-1
Another problem is caused by using integer literals in context where 
parameters can not be used, for example "order by 1".


Fully correct substitution can be done by first performing parsing the 
query, then transform parse tree, replacing literal nodes with parameter 
nodes and finally deparse tree into generalized query. postgres_fdw 
already contains such deparse code. It can be moved to postgres core and 
reused for autoprepare (and may be somewhere else).

But in this case overhead will be much higher.
I still think that query parsing time is significantly smaller than time 
needed for building and optimizing query execution plan.

But it should be measured if community will be interested in such approach.

There is obvious question: how I managed to get this pgbench results if 
currently only substitution of string literals is supported and queries 
constructed by pgbench don't contain string literals? I just made small 
patch in pgbench replaceVariable method wrapping value's representation 
in quotes. It has almost no impact on performance (3482 TPS  vs. 3492 TPS),

but allows autoprepare to deal with pgbench queries.

I attached my patch to this mail. It is just first version of the patch 
(based on REL9_6_STABLE branch) just to illustrate proposed approach.
I will be glad to receive any comments and if such optimization is 
considered to be useful, I will continue work on this patch.


--

Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index f6be98b..6291d66 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -96,8 +96,6 @@ int			max_stack_depth = 100;
 /* wait N seconds to allow attach from a debugger */
 int			PostAuthDelay = 0;
 
-
-
 /* 
  *		private variables
  * 
@@ -188,6 +186,7 @@ static bool IsTransactionStmtList(List *parseTrees);
 static void drop_unnamed_stmt(void);
 static void SigHupHandler(SIGNAL_ARGS);
 static void log_disconnections(int code, Datum arg);
+static bool exec_cached_query(const char *query_string);
 
 
 /* 
@@ -916,6 +915,14 @@ exec_simple_query(const char *query_string)
 	drop_unnamed_stmt();
 
 	/*
+	 * Try to find cached plan
+	 */
+	if (autoprepare_threshold != 0 && exec_cached_query(query_string))
+	{
+		return;
+	}
+
+	/*
 	 * Switch to appropriate context for constructing parsetrees.
 	 */
 	oldcontext = MemoryContextSwitchTo(MessageContext);
@@ -4500,3 +4509,566 @@ log_disconnections(int code, Datum arg)
 	port->user_name, port->database_name, port->remote_host,
   port->remote_port[0] ? " port=" : "", port->remote_port)));
 }
+
+
+
+typedef struct { 
+	char const* query;
+	int64 exec_count;
+	CachedPlanSource* plan;	
+	int n_params;
+	int16 format;
+} plan_cache_entry;
+
+
+/*
+ * Replace string literals with parameters. We do not consider integer or real literals to avoid problems with 
+ * negative number, user defined operators, ... For example it is not easy to distinguish cases (-1), (1-1), (1-1)-1
+ */
+static void generalize_statement(const char *query_string, char** gen_query, char** query_params, int* n_params)
+{
+	size_t

Re: [HACKERS] Cached plans and statement generalization




On 25.04.2017 19:12, Serge Rielau wrote:


On Apr 25, 2017, at 8:11 AM, Konstantin Knizhnik 
mailto:k.knizh...@postgrespro.ru>> wrote:
Another problem is caused by using integer literals in context where 
parameters can not be used, for example "order by 1”.
You will also need to deal with modifiers in types such as 
VARCHAR(10). Not sure if there are specific functions which can only 
deal with literals (?) as well.


Sorry, I do not completely understand how presence of type modifiers can 
affect string literals used in query.

Can you provide me some example?



Doug Doole did this work in DB2 LUW and he may be able to point to 
more places to watch out for semantically.


Generally, in my experience, this feature is very valuable when 
dealing with (poorly designed) web apps that just glue together strings.


I do not think that this optimization will be useful only for poorly 
designed application.
I already pointed on two use cases where prepapred statements can not be 
used:

1. pgbouncer without session-level pooling.
2. partitioning

Protecting it under a GUC would allow to only do the work if it’s 
deemed likely to help.
Another rule I find useful is to abort any efforts to substitute 
literals if any bind variable is found in the query.
That can be used as a cue that the author of the SQL left the 
remaining literals in on purpose.


A follow up feature would be to formalize different flavors of peeking.
I.e. can you produce a generic plan, but still recruit the initial set 
of bind values/substituted literals to dos costing?

Here situation is the same as for explicitly prepared statements, isn't it?
Sometimes it is preferrable to use specialized plan rather than generic 
plan.

I am not sure if postgres now is able to do it.




Cheers
Serge Rielau
Salesforce.com <http://salesforce.com>

PS: FWIW, I like this feature.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Cached plans and statement generalization

On 04/25/2017 07:54 PM, David Fetter wrote:

On Tue, Apr 25, 2017 at 06:11:09PM +0300, Konstantin Knizhnik wrote:

On 24.04.2017 21:43, Andres Freund wrote:

Hi,

On 2017-04-24 11:46:02 +0300, Konstantin Knizhnik wrote:

So what I am thinking now is implicit query caching. If the same query with
different literal values is repeated many times, then we can try to
generalize this query and replace it with prepared query with
parameters.

That's not actuall all that easy:
- You pretty much do parse analysis to be able to do an accurate match.
How much overhead is parse analysis vs. planning in your cases?
- The invalidation infrastructure for this, if not tied to to fully
parse-analyzed statements, is going to be hell.
- Migrating to parameters can actually cause significant slowdowns, not
nice if that happens implicitly.

Well, first of all I want to share results I already get: pgbench with
default parameters, scale 10 and one connection:

protocol
TPS
simple
3492
extended
2927
prepared
6865
simple + autoprepare
6844

If this is string mashing on the unparsed query, as it appears to be,
it's going to be a perennial source of security issues.

Sorry, may be I missed something, but I can not understand how security can be violated by extracting string literals from query. I am just copying bytes from one buffer to another. I do not try to somehow interpret this parameters. What I am doing is
very similar with standard prepared statements.

And moreover query is parsed! Only query which was already parsed and executed
(but with different values of parameters) can be autoprepared.

Best,
David.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Cached plans and statement generalization


On 04/25/2017 08:09 PM, Serge Rielau wrote:


On Tue, Apr 25, 2017 at 9:45 AM, Konstantin Knizhnik 
 wrote:

On 25.04.2017 19:12, Serge Rielau wrote:


On Apr 25, 2017, at 8:11 AM, Konstantin Knizhnik mailto:k.knizh...@postgrespro.ru>> wrote:
Another problem is caused by using integer literals in context where 
parameters can not be used, for example "order by 1”.

You will also need to deal with modifiers in types such as VARCHAR(10). 
Not sure if there are specific functions which can only deal with literals (?) 
as well.

Sorry, I do not completely understand how presence of type modifiers can 
affect string literals used in query.
Can you provide me some example?

SELECT ‘hello’::CHAR(10) || ‘World’, 5 + 6;

You can substitute ‘hello’, ‘World’, 5, and 6. But not 10.


I am substituting only string literals. So the query above will be transformed 
to

SELECT $1::CHAR(10) || $2, 5 + 6;

What's wrong with it?



Also some OLAP syntax like “rows preceding”

It pretty much boils down to whether you can do some shallow parsing rather 
than expending the effort to build the parse tree.

Cheers
Serge



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Cached plans and statement generalization


On 04/25/2017 11:40 PM, Serge Rielau wrote:



On Apr 25, 2017, at 1:37 PM, Konstantin Knizhnik mailto:k.knizh...@postgrespro.ru>> wrote:



SELECT ‘hello’::CHAR(10) || ‘World’, 5 + 6;

You can substitute ‘hello’, ‘World’, 5, and 6. But not 10.


I am substituting only string literals. So the query above will be transformed 
to

SELECT $1::CHAR(10) || $2, 5 + 6;

What's wrong with it?


Oh, well that leaves a lot of opportunities on the table, doesn’t it?


Well, actually my primary intention was not to make badly designed programs 
(not using prepared statements) work faster.
I wanted to address cases when it is not possible to use prepared statements.
If we want to substitute with parameters as much literals as possible, then 
parse+deparse tree seems to be the only reasonable approach.
I will try to implement it also, just to estimate parsing overhead.





Cheers
Serge




--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Cached plans and statement generalization




On 26.04.2017 00:47, Andres Freund wrote:

On 2017-04-25 21:11:08 +, Doug Doole wrote:

When I did this in DB2, I didn't use the parser - it was too expensive. I
just tokenized the statement and used some simple rules to bypass the
invalid cases. For example, if I saw the tokens "ORDER" and "BY" then I'd
disallow replacement replacement until I hit the end of the current
subquery or statement.

How did you manage plan invalidation and such?


The same mechanism as for prepared statements.
Cached plans are linked in the list by SaveCachedPlan function and are 
invalidated by PlanCacheRelCallback.





- Andres




--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Cached plans and statement generalization




On 26.04.2017 01:34, Andres Freund wrote:

Hi,

(FWIW, on this list we don't do top-quotes)

On 2017-04-25 22:21:22 +, Doug Doole wrote:

Plan invalidation was no different than for any SQL statement. DB2 keeps a
list of the objects the statement depends on. If any of the objects changes
in an incompatible way the plan is invalidated and kicked out of the cache.

I suspect what is more interesting is plan lookup. DB2 has something called
the "compilation environment". This is a collection of everything that
impacts how a statement is compiled (SQL path, optimization level, etc.).
Plan lookup is done using both the statement text and the compilation
environment. So, for example, if my path is DOUG, MYTEAM, SYSIBM and your
path is ANDRES, MYTEAM, SYSIBM we will have different compilation
environments. If we both issue "SELECT * FROM T" we'll end up with
different cache entries even if T in both of our statements resolves to
MYTEAM.T. If I execute "SELECT * FROM T", change my SQL path and then
execute "SELECT * FROM T" again, I have a new compilation environment so
the second invocation of the statement will create a new entry in the
cache. The first entry is not kicked out - it will still be there for
re-use if I change my SQL path back to my original value (modulo LRU for
cache memory management of course).

It's not always that simple, at least in postgres, unless you disregard
search_path.  Consider e.g. cases like

CREATE SCHEMA a;
CREATE SCHEMA b;
CREATE TABLE a.foobar(somecol int);
SET search_patch = 'b,a';
SELECT * FROM foobar;
CREATE TABLE b.foobar(anothercol int);
SELECT * FROM foobar; -- may not be cached plan from before!

it sounds - my memory of DB2 is very faint, and I never used it much -
like similar issues could arise in DB2 too?


There is the same problem with explicitly prepared statements, isn't it?
Certainly in case of using prepared statements it is responsibility of 
programmer to avoid such collisions.

And in case of autoprepare programmer it is hidden from programming.
But there is guc variable controlling autoprepare feature and by default 
it is switched off.
So if programmer or DBA enables it, then them should take in account 
effects of such decision.


By the way, isn't it a bug in PostgreSQL that altering search path is 
not invalidating cached plans?
As I already mentioned, the same problem can be reproduced with 
explicitly prepared statements.






Greetings,

Andres Freund


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Cached plans and statement generalization




On 26.04.2017 04:00, Tsunakawa, Takayuki wrote:

From: pgsql-hackers-ow...@postgresql.org

[mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Konstantin
Knizhnik
Well, first of all I want to share results I already get: pgbench with default
parameters,  scale 10 and one connection:

So autoprepare is as efficient as explicit prepare and can increase
performance almost two times.

This sounds great.

BTW, when are the autoprepared statements destroyed?
Right now them are destroyed only in case of receiving invalidation  
message (when catalog is changed).
Prepared statements are local to backend and are located in backend's  
memory.
It is unlikely, that there will be too much different queries which  
cause memory overflow.

But in theory such situation is certainly possible.



  Are you considering some upper limit on the number of prepared statements?
In this case we need some kind of LRU for maintaining cache of  
autoprepared statements.
I think that it is good idea to have such limited cached - it can avoid  
memory overflow problem.

I will try to implement it.



Regards
Takayuki Tsunakawa




--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Cached plans and statement generalization




On 26.04.2017 10:49, Konstantin Knizhnik wrote:



On 26.04.2017 04:00, Tsunakawa, Takayuki wrote:   Are you considering 
some upper limit on the number of prepared statements?
In this case we need some kind of LRU for maintaining cache of 
autoprepared statements.
I think that it is good idea to have such limited cached - it can 
avoid memory overflow problem.

I will try to implement it.


I attach new patch which allows to limit the number of autoprepared 
statements (autoprepare_limit GUC variable).
Also I did more measurements, now with several concurrent connections 
and read-only statements.
Results of pgbench with 10 connections, scale 10 and read-only 
statements are below:


Protocol
TPS
extended
87k
prepared
209k
simple+autoprepare
206k


As you can see, autoprepare provides more than 2 times speed improvement.

Also I tried to measure overhead of parsing (to be able to substitute 
all literals, not only string literals).

I just added extra call of pg_parse_query. Speed is reduced to 181k.
So overhead is noticeable, but still making such optimization useful.
This is why I want to ask question:  is it better to implement slower 
but safer and more universal solution?


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index f6be98b..0c9abfc 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -188,6 +188,7 @@ static bool IsTransactionStmtList(List *parseTrees);
 static void drop_unnamed_stmt(void);
 static void SigHupHandler(SIGNAL_ARGS);
 static void log_disconnections(int code, Datum arg);
+static bool exec_cached_query(const char *query_string);
 
 
 /* 
@@ -916,6 +917,14 @@ exec_simple_query(const char *query_string)
 	drop_unnamed_stmt();
 
 	/*
+	 * Try to find cached plan
+	 */
+	if (autoprepare_threshold != 0 && exec_cached_query(query_string))
+	{
+		return;
+	}
+
+	/*
 	 * Switch to appropriate context for constructing parsetrees.
 	 */
 	oldcontext = MemoryContextSwitchTo(MessageContext);
@@ -4500,3 +4509,606 @@ log_disconnections(int code, Datum arg)
 	port->user_name, port->database_name, port->remote_host,
   port->remote_port[0] ? " port=" : "", port->remote_port)));
 }
+
+typedef struct { 
+	char const*   query;
+	dlist_nodelru;
+	int64 exec_count;
+	CachedPlanSource* plan;	
+	int   n_params;
+	int16 format;
+	bool  disable_autoprepare;
+} plan_cache_entry;
+
+/*
+ * Replace string literals with parameters. We do not consider integer or real literals to avoid problems with 
+ * negative number, user defined operators, ... For example it is not easy to distinguish cases (-1), (1-1), (1-1)-1
+ */
+static void generalize_statement(const char *query_string, char** gen_query, char** query_params, int* n_params)
+{
+	size_t query_len = strlen(query_string);
+	char const* src = query_string;
+	char* dst;
+	char* params;
+	unsigned char ch;
+
+	*n_params = 0;
+
+	*gen_query = (char*)palloc(query_len*2); /* assume that we have less than 1000 parameters, the worst case is replacing '' with $999 */
+	*query_params = (char*)palloc(query_len + 1);
+	dst = *gen_query;
+	params = *query_params;
+
+	while ((ch = *src++) != '\0') { 
+		if (isspace(ch)) { 
+			/* Replace sequence of whitespaces with just one space */
+			while (*src && isspace(*(unsigned char*)src)) { 
+src += 1;
+			}
+			*dst++ = ' ';
+		} else if (ch == '\'') { 
+			while (true) { 
+ch = *src++;
+if (ch == '\'') { 
+	if (*src != '\'') { 
+		break;
+	} else {
+		/* escaped quote */
+		*params++ = '\'';
+		src += 1;
+	}
+} else { 
+	*params++ = ch;
+}
+			}
+			*params++ = '\0';
+			dst += sprintf(dst, "$%d", ++*n_params);
+		} else { 
+			*dst++ = ch;
+		}
+	}			
+	Assert(dst <= *gen_query + query_len);
+	Assert(params <= *query_params + query_len*2);
+	*dst = '\0';
+}
+
+static uint32 plan_cache_hash_fn(const void *key, Size keysize)
+{
+	return string_hash(((plan_cache_entry*)key)->query, 0);
+}
+
+static int plan_cache_match_fn(const void *key1, const void *key2, Size keysize)
+{
+	return strcmp(((plan_cache_entry*)key1)->query, ((plan_cache_entry*)key2)->query);
+}
+
+static void* plan_cache_keycopy_fn(void *dest, const void *src, Size keysize)
+{ 
+	((plan_cache_entry*)dest)->query = pstrdup(((plan_cache_entry*)src)->query);
+return dest;
+}
+
+#define PLAN_CACHE_SIZE 113
+
+size_t nPlanCacheHits;
+size_t nPlanCacheMisses;
+
+/*
+ * Try to generalize query, find cached plan for it and execute
+ */
+static bool exec_cached_query(const char *query_string)
+{
+	Co

Re: [HACKERS] Cached plans and statement generalization