I wrote: > Petr Jelinek <p...@2ndquadrant.com> writes: >> I can only get the issue when the sort order of the individual keys does >> not correlate and the operator sorts according to the first column and >> there are duplicate values for the first column.
> Yeah, I think the combination of ASC and DESC columns may be necessary to > break things. It needs closer analysis. Okay, I've now studied the patch more carefully and understood that it wasn't quite so naive as to not consider multi-column cases at all. It does work for simple scalar multi-column cases, because the test on SK_BT_MATCHED is placed so that it only skips checking the first column's condition, not the conditions on lower-order columns. Rather, the problem is failure to consider ROW() comparisons, which do not necessarily have the same simple behavior as scalar comparisons. They sort of accidentally work, *if* the lower-order columns in the ROW() comparison match the index's lower-order columns as to position and index direction. Tobias' example fails because the second column in the ROW() isn't sorted the same, and you could also make it fail with a 3-or-more-column index in which the ROW() comparison tested, say, just the first and third columns. What I think we should do is move the SK_BT_MATCHED flag down to the first sub-key of the ROW() comparison, which is the one that does behave the same as in scalar cases. The fact that it doesn't fail for the case where the lower-order columns of the ROW() do match the index shows that this still leaves something on the table: within a ROW() comparison, you could set SK_BT_MATCHED on more than just the first sub-key, *if* the subsequent columns match the index. This makes me think that the analysis should be getting done in or near _bt_mark_scankey_required, not in the rather random place it is now, because _bt_mark_scankey_required already understands about sub-keys matching the index or not. However ... there is an even larger problem, which is that the patch thinks it can use skey->sk_flags for what's effectively transient state storage --- not merely transient, but scan-direction-dependent. This means that a cursor that reads forwards for awhile and then turns around and reads backwards will get the wrong answer, even without anything as complicated as multiple key columns. It's a bit harder to exhibit such a bug than you might guess, because of btree's habit of prefetching an index page at a time; you have to make the scan cross an index page boundary before you turn around and back it up. Bearing that in mind, I was soon able to devise a failure case in the regression database: regression=# set enable_seqscan TO 0; SET regression=# set enable_bitmapscan TO 0; SET regression=# begin; BEGIN regression=# declare c cursor for select unique1,unique2 from tenk1 where unique1 > 9500; DECLARE CURSOR regression=# fetch 20 from c; unique1 | unique2 ---------+--------- 9501 | 5916 9502 | 1812 9503 | 1144 9504 | 9202 9505 | 4300 9506 | 5551 9507 | 847 9508 | 4093 9509 | 9418 9510 | 1862 9511 | 848 9512 | 4823 9513 | 1125 9514 | 9276 9515 | 1160 9516 | 5177 9517 | 3600 9518 | 9677 9519 | 5518 9520 | 1429 (20 rows) regression=# fetch backward 30 from c; unique1 | unique2 ---------+--------- 9519 | 5518 9518 | 9677 9517 | 3600 9516 | 5177 9515 | 1160 9514 | 9276 9513 | 1125 9512 | 4823 9511 | 848 9510 | 1862 9509 | 9418 9508 | 4093 9507 | 847 9506 | 5551 9505 | 4300 9504 | 9202 9503 | 1144 9502 | 1812 9501 | 5916 9500 | 3676 9499 | 927 9498 | 2807 9497 | 6558 9496 | 1857 9495 | 1974 9494 | 560 9493 | 3531 9492 | 9875 9491 | 7237 9490 | 8928 (30 rows) Ooops. I believe the way to fix this would be to stop regarding SK_BT_MATCHED as state, and instead treat it as a scankey property identified during _bt_preprocess_keys, analogously to SK_BT_REQFWD/SK_BT_REQBKWD --- and, like those, you'd need two flags not one since the properties will be determined independently of knowing which direction you'll be going in. In any event, I am now of the opinion that this patch needs to be reverted outright and returned to the authors for redesign. There are too many things wrong with it and too little reason to think that we have to have it in 9.5. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers