Thanks for the new patch. I've reviewed the skip scan patch, but haven't taken 
a close look at the uniquekeys patch yet.


In my previous review I mentioned that queries of the form:

select distinct on(a) a,b from a where a=1;

do not lead to a skip scan with the patch, even though the skip scan would be 
much faster. It's about this piece of code in planner.c


/* Consider index skip scan as well */
if (enable_indexskipscan &&
IsA(path, IndexPath) &&
((IndexPath *) path)->indexinfo->amcanskip &&
root->distinct_pathkeys != NIL)

The root->distinct_pathkeys is already filtered for redundant keys, so column 
'a' is not in there anymore. Still, it'd be much faster to use the skip scan 
here, because a regular scan will scan all entries with a=1, even though we're 
really only interested in the first one. In previous versions, this would be 
fixed by changing the check in planner.c to use root->uniq_distinct_pathkeys 
instead of root->distinct_pathkeys, but things change a bit now that the patch 
is rebased on the unique-keys patch. Would it be valid to change this check to 
root->query_uniquekeys != NIL to consider skip scans also for this query?

Second is about the use of _bt_skip and _bt_read_closest in nbtsearch.c. I 
don't think _bt_read_closest is correctly implemented, and I'm not sure if it 
can be used at all, due to concerns by Tom and Peter about such approach. I had 
a similar idea to only partially read items from a page for another use case, 
for which I submitted a patch last Friday. However, both Tom and Peter find 
this idea quite scary [1]. You could take a look at my patch on that thread to 
see the approach taken to correctly partially read a page (well, correct as far 
as I can see so far...), but perhaps we need to just use the regular 
_bt_readpage function which reads everything, although this is unfortunate from 
a performance point of view, since most of the time we're indeed just 
interested in the first tuple.

Eg. it looks like there's some mixups between index offsets and heap tid's in 
_bt_read_closest:
/*
* Save the current item and the previous, even if the
* latter does not pass scan key conditions
*/
ItemPointerData tid = prevItup->t_tid;
OffsetNumber prevOffnum = ItemPointerGetOffsetNumber(&tid);

_bt_saveitem(so, itemIndex, prevOffnum, prevItup);
itemIndex++;

_bt_saveitem(so, itemIndex, offnum, itup);
itemIndex++;

The 'prevOffnum' is the offset number for the heap tid, and not the offset 
number for the index offset, so it looks like just a random item is saved. 
Furthermore, index offsets may change due to insertions and vacuums, so if we, 
at any point, release the lock, these offsets are not necessarily valid 
anymore. However, currently, the patch just reads the closest and then doesn't 
consider this page at all anymore, if the first tuple skipped to turns out to 
be not visible. Consider the following sql to illustrate:

create table a (a int, b int, c int);
insert into a (select vs, ks, 10 from generate_series(1,5) vs, 
generate_series(1, 10000) ks);
create index on a (a,b);
analyze a;
select distinct on (a) a,b from a order by a,b;

 a | b
---+---
 1 | 1
 2 | 1
 3 | 1
 4 | 1
 5 | 1
(5 rows)

delete from a where a=2 and b=1;
DELETE 1

select distinct on (a) a,b from a order by a,b;

 a |  b
---+-----
 1 |   1
 2 | 249           ->> this should be b=2, because we deleted a=2 && b=1. 
however, it doesn't consider any tuples from that page anymore and gives us the 
first tuple from the next page.
 3 |   1
 4 |   1
 5 |   1
(5 rows)
?


-Floris


[1] 
https://www.postgresql.org/message-id/flat/26641.1564778586%40sss.pgh.pa.us#dd8f23e1704f45447185894e1c2a4f2a

Reply via email to