Useless memoize path generated for unique join on primary keys

2022-05-03 Thread Benjamin Coutu
Hello,

I have come across a plan that should never get generated IMHO:

SELECT 1
FROM extdataregular e1
INNER JOIN extdataempty e2 ON e1.field = e2.field AND e1.index = e2.index

generates the following plan:

Nested Loop  (cost=1.13..528540.89 rows=607604 width=4) (actual 
time=9298.504..9298.506 rows=0 loops=1)
  ->  Index Only Scan using pk_extdataempty on extdataempty e2  
(cost=0.56..157969.52 rows=4078988 width=16) (actual time=0.026..641.248 
rows=4067215 loops=1)
Heap Fetches: 268828
  ->  Memoize  (cost=0.58..0.67 rows=1 width=16) (actual time=0.002..0.002 
rows=0 loops=4067215)
Cache Key: e2.field, e2.index
Cache Mode: logical
Hits: 0  Misses: 4067215  Evictions: 3228355  Overflows: 0  Memory 
Usage: 65537kB
Buffers: shared hit=16268863
->  Index Only Scan using pk_extdataregular on extdataregular e1  
(cost=0.57..0.66 rows=1 width=16) (actual time=0.001..0.001 rows=0 
loops=4067215)
  Index Cond: ((field = e2.field) AND (index = e2.index))
  Heap Fetches: 2

Please note that the memoize node has no cache hits, which is not surprising 
given that we are joining on two primary keys that are unique by definition 
("field" and "index" make up the primary key of both tables).
Why would it ever make sense to generate a memoize plan for a unique join?

I think this issue might tie in with the current discussion over on the hackers 
mailing list [1]

Cheers, Ben

[1] 
https://www.postgresql.org/message-id/flat/CAApHDvpFsSJAThNLtqaWvA7axQd-VOFct%3DFYQN5muJV-sYtXjw%40mail.gmail.com

-- 

Bejamin Coutu
ben.co...@zeyos.com

ZeyOS GmbH & Co. KG
http://www.zeyos.com




Re: Useless memoize path generated for unique join on primary keys

2022-05-03 Thread David Rowley
On Tue, 3 May 2022 at 23:05, Benjamin Coutu  wrote:
>   ->  Memoize  (cost=0.58..0.67 rows=1 width=16) (actual time=0.002..0.002 
> rows=0 loops=4067215)
> Cache Key: e2.field, e2.index
> Cache Mode: logical
> Hits: 0  Misses: 4067215  Evictions: 3228355  Overflows: 0  Memory 
> Usage: 65537kB
> Buffers: shared hit=16268863
> ->  Index Only Scan using pk_extdataregular on extdataregular e1  
> (cost=0.57..0.66 rows=1 width=16) (actual time=0.001..0.001 rows=0 
> loops=4067215)
>   Index Cond: ((field = e2.field) AND (index = e2.index))

> Why would it ever make sense to generate a memoize plan for a unique join?

It wouldn't ever make sense.

The problem is that estimate_num_groups() is used to estimate the
number of distinct values and that function does not know about
primary keys. There's no way the costing of Memoize would allow a
Memoize plan to be used if it thought all values were unique, so the
only possibility here is that ndistinct is being underestimated by
some amount that makes Memoize look like the most favourable plan.

You could see what the planner thinks about the ndistinct estimate on
field, index by doing:

EXPLAIN SELECT field,index FROM extdataregular GROUP BY 1,2;

Whatever you see in the final row estimate for that plan is what's
being fed into the Memoize costing code.

> I think this issue might tie in with the current discussion over on the 
> hackers mailing list [1]

I'd say it's a pretty different problem. The cache hit ratio
discussion on that thread talks about underestimating the hit ratio.
That particular problem could only lead to Memoize plans *not* being
chosen when they maybe should be. Not the other way around, which is
your case.

create statistics extdataregular_field_index_stats (ndistinct) on
field, index from extdataregular;
analyze extdataregular;

would likely put that right.

David




Re: Useless memoize path generated for unique join on primary keys

2022-05-03 Thread Benjamin Coutu
> I'd say it's a pretty different problem. The cache hit ratio
> discussion on that thread talks about underestimating the hit ratio.
> That particular problem could only lead to Memoize plans *not* being
> chosen when they maybe should be. Not the other way around, which is
> your case.
> 
> create statistics extdataregular_field_index_stats (ndistinct) on
> field, index from extdataregular;
> analyze extdataregular;
> 
> would likely put that right.

Thanks David, using extended statistics for both (and only for both) tables 
solved this problem.

BTW, thank you for all your work on performance in recent releases.




Re: Useless memoize path generated for unique join on primary keys

2022-05-03 Thread David Rowley
On Wed, 4 May 2022 at 00:21, Benjamin Coutu  wrote:
> Thanks David, using extended statistics for both (and only for both) tables 
> solved this problem.

Oh, whoops. I did get that backwards.  The estimate used by the
Memoize costing code is from the outer side of the join, which is the
extdataempty in this case. I don't think the
extdataregular_field_index_stats will do anything. It'll be the ones
you added on extdataempty that are making it work.

> BTW, thank you for all your work on performance in recent releases.

Thanks for the feedback :)

David




Window partial fetch optimization

2022-05-03 Thread Levi Aul
I have a “temporal table” — a table where there are multiple “versions” of
entities, with each version having a distinct timestamp:

CREATE TABLE contract_balance_updates (
block_id bigint NOT NULL,
block_signed_at timestamp(0) without time zone NOT NULL,
contract_address bytea NOT NULL,
holder_address bytea NOT NULL,
start_block_height bigint NOT NULL,
balance numeric NOT NULL
) PARTITION BY RANGE (block_signed_at);

-- one for each partition (applied by pg_partman from a template)
CREATE UNIQUE INDEX contract_balance_updates_pkey
ON contract_balance_updates(
holder_address bytea_ops,
contract_address bytea_ops,
start_block_height int8_ops DESC
);


This table has ~1 billion rows; millions of entities (i.e. (holder_address,
contract_address) pairs); and for a few entities (power-law distribution),
there are millions of versions (i.e. rows with distinct start_block_height
values.)

The main query this table needs to support, is to efficiently get the
newest version-rows of each contract_address for a given holder_address, as
of a given application-domain time. (Think of it as: the what the set of
entities owned by a user looked like at a given time.) The “as of” part is
important here: it’s why we can’t just use the usual system-temporal setup
with separate “historical” and “current version” tables. (Also, due to our
source being effectively an event store, and due to our throughput
requirements [~100k new records per second], we must discover+insert new
entity-version rows concurrently + out-of-order; so it’d be pretty
non-trivial to keep anything like a validity-upper-bound column updated
using triggers.)

It is our expectation that this query “should” be able to be
cheap-to-compute and effectively instantaneous. (It’s clear to us how we
would make it so, given a simple LMDB-like sorted key-value store:
prefix-match on holder_address; take the first row you find for the
contract-address you’re on; build a comparator key of (holder_address,
contract_address, highest-possible-version) and traverse to find the lowest
row that sorts greater than it; repeat.)

Which might, in SQL, be expressed as something like this:

WITH ranked_holder_balances AS (
SELECT
*,
row_number() OVER w AS balance_rank
FROM contract_balance_updates
WHERE holder_address = '\x'::bytea
WINDOW w AS (
PARTITION BY holder_address, contract_address
ORDER BY start_block_height DESC
)
ORDER BY holder_address ASC, contract_address ASC, start_block_height DESC
)
SELECT *
FROM ranked_holder_balances
WHERE balance_rank = 1


The trouble is that this query seems to traverse the tuples (or maybe just
the index nodes?) of every row in the matched partitions. We know that the
query only “needs" to touch the first row of each partition (row_number() =
1) to resolve the query; but Postgres seemingly isn’t aware of this
potential optimization. So the query is fast when all matched entities have
few versions; but when any matched entities have millions of versions, the
cold performance of the query becomes extremely bad.

Subquery Scan on ranked_holder_balances  (cost=5.02..621761.05
rows=2554 width=55) (actual time=270.031..82148.370 rows=856 loops=1)
  Filter: (ranked_holder_balances.balance_rank = 1)
  Rows Removed by Filter: 554167
  Buffers: shared hit=166647 read=391704 dirtied=65
  ->  WindowAgg  (cost=5.02..605150.30 rows=510707 width=81) (actual
time=270.028..82098.501 rows=555023 loops=1)
Buffers: shared hit=166647 read=391704 dirtied=65
->  Merge Append  (cost=5.02..584722.02 rows=510707 width=65)
(actual time=270.017..81562.693 rows=555023 loops=1)
  Sort Key: contract_balance_updates.contract_address,
contract_balance_updates.start_block_height DESC
  Buffers: shared hit=166647 read=391704 dirtied=65
  ->  Index Scan using contract_balance_updates_pkey_p2015
on contract_balance_updates_p2015 contract_balance_updates_1
(cost=0.28..2.51 rows=1 width=65) (actual time=0.013..0.014 rows=0
loops=1)
Index Cond: (holder_address =
'\xe03c23519e18d64f144d2800e30e81b0065c48b5'::bytea)
Buffers: shared hit=2
  ->  Index Scan using contract_balance_updates_pkey_p2016
on contract_balance_updates_p2016 contract_balance_updates_2
(cost=0.42..8.34 rows=6 width=65) (actual time=0.010..0.011 rows=0
loops=1)
Index Cond: (holder_address =
'\xe03c23519e18d64f144d2800e30e81b0065c48b5'::bytea)
Buffers: shared hit=3
  ->  Index Scan using contract_balance_updates_pkey_p2017
on contract_balance_updates_p2017 contract_balance_updates_3
(cost=0.56..44599.76 rows=40460 width=65) (actual
time=269.891..6690.808 rows=41677 loops=1)
Index Cond: (holder_address =
'\xe03c23519e18d64f144d2800e30e81b0065c48b5'::bytea)
Buffers: shared hit=11596 read=30025
 

Re: Window partial fetch optimization

2022-05-03 Thread David Rowley
On Wed, 4 May 2022 at 06:11, Levi Aul  wrote:
> It is our expectation that this query “should” be able to be cheap-to-compute 
> and effectively instantaneous. (It’s clear to us how we would make it so, 
> given a simple LMDB-like sorted key-value store: prefix-match on 
> holder_address; take the first row you find for the contract-address you’re 
> on; build a comparator key of (holder_address, contract_address, 
> highest-possible-version) and traverse to find the lowest row that sorts 
> greater than it; repeat.)
>
> Which might, in SQL, be expressed as something like this:
>
> WITH ranked_holder_balances AS (
> SELECT
> *,
> row_number() OVER w AS balance_rank
> FROM contract_balance_updates
> WHERE holder_address = '\x'::bytea
> WINDOW w AS (
> PARTITION BY holder_address, contract_address
> ORDER BY start_block_height DESC
> )
> ORDER BY holder_address ASC, contract_address ASC, start_block_height DESC
> )
> SELECT *
> FROM ranked_holder_balances
> WHERE balance_rank = 1

Yes, PostgreSQL 14 is not very smart about realising that WHERE
balance_rank = 1 is only going to match the first row of each window
partition. PostgreSQL 15 (coming later this year) should be better in
this regard as some work was done to teach the query planner about
monotonic window functions [1].  However, that change likely does not
do all you'd like here as the WindowAgg node still must consume and
throw away all tuples until it finds the first tuple belonging to the
next window partition.  It sounds like you really want "Skip Scans" or
"Loose Index Scans" which are implemented by some other RDBMS'.  I
imagine that even with the change to PostgreSQL 15 that it still
wouldn't be as fast as your DISTINCT ON example.

> WITH bup1 AS (
> SELECT DISTINCT bup.holder_address, bup.contract_address
> FROM contract_balance_updates bup
> WHERE bup.holder_address = 
> '\xe03c23519e18d64f144d2800e30e81b0065c48b5'::bytea
> ORDER BY bup.contract_address ASC
> )
> SELECT
>   bup1.holder_address,
>   bup1.contract_address,
>   (
> SELECT balance
> FROM contract_balance_updates bup2
> WHERE bup2.holder_address = bup1.holder_address
> AND bup2.contract_address = bup1.contract_address
> ORDER BY bup2.holder_address ASC, bup2.contract_address ASC, 
> bup2.start_block_height DESC
> LIMIT 1
>   ) AS balance
> FROM bup1

> I really don’t like this last approach; it scans twice, it’s surprising / 
> confusing for people maintaining the query, etc. I believe that, due to the 
> correlated subquery, the planning overhead is also O(N) with the number of 
> matched entities increases (though I don’t have a good test-case for this.)

No, the subquery is not replanned each time it is rescanned. It's
planned once and that same plan will be executed each time. So no O(n)
planning overhead.

> Is there any way to get PG to do what this last query is doing, purely using 
> window-functions / distinct on / etc.? Because, if there is, I can’t find it.
>
> It seems that PG can in fact do index-range-seeking (since that’s what it’s 
> doing when gathering the distinct contract_addresses in the last query.) It 
> seems intuitive to me that it should be using such an approach to filter for 
> rows in window/group-partitions, when a criteria+index that can be combined 
> to limit the size of the window/group are available to the planner. And that, 
> even when not able to be automatically inferred, it would make sense for 
> there to be control over such behaviour in SQL, using hypothetical syntax 
> like:

Unfortunately, DISTINCT can only be implemented with Hash Aggregate or
Sort / Index Scan + Unique.  We don't have anything currently which
will jump to the next highest key in an index.  There has been some
work on what we're starting to call "Skip Scans", but it's all still
work in progress.

You might find something useful in [2] which might help speed up your
DISTINCT query.

> -- for windows
> row_number() OVER (PARTITION BY x ORDER BY x LIMIT 10 OFFSET 3)
>
> -- for groups
> GROUP BY x, y, z (APPLYING LIMIT 20 OFFSET 5 PER GROUP)
>
>
> Does this make sense? Or is this something PG is already doing, and I just 
> haven’t found the right magic words / built my index correctly to unlock it? 
> (I notice that the last example is an index-only scan; would I get this 
> behaviour from the previous two queries if I made the index a covering index 
> such that those could be index-only scans as well?)

Unfortunately, there is no magic words here. PostgreSQL 14 simply has
no ability to know that row_number() is monotonically increasing,
therefore has no ability to skip any processing for rows that are
never needed.

David

[1] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9d9c02ccd1aea8e9131d8f4edb21bf1687e40782
[2] https://wiki.postgresql.org/wiki/Loose_indexscan