> Try the attached patch -- it has DEBUG1 traces with the contents of
> index tuples from key points during index scans, allowing you to see
> what's going on both as a B-Tree is descended, and as a range scan is
> performed. It also shows details of the insertion scankey that is set
> up within _bt_first(). This hasn't been adopted to the patch at all,
> so you'll probably need to do that.

Thanks! That works quite nicely.

I've pinpointed the problem to within _bt_skip. I'll try to illustrate with my 
test case. The data in the table is (a,b)=(1,1), (1,2) ... (1,10000), (2, 1), 
(2,2), ... (2,10000) until (5,10000).
Running the query
SELECT DISTINCT ON (a) a,b FROM a WHERE b=2;
The flow is like this:
_bt_first is called first - it sees there are no suitable scan keys to start at 
a custom location in the tree, so it just starts from the beginning and 
searches until it finds the first tuple (1,2).
After the first tuple was yielded, _bt_skip kicks in. It constructs an insert 
scan key with a=1 and nextkey=true, so doing the _bt_search + _bt_binsrch on 
this, it finds the first tuple larger than this: (2,1). This is not the tuple 
that it stops at though, because after that it does this:

if (ScanDirectionIsForward(dir))
/* Move back for _bt_next */
offnum = OffsetNumberPrev(offnum);
....
/* Now read the data */
    if (!_bt_readpage(scan, dir, offnum))
{
/*
* There's no actually-matching data on this page.  Try to advance to
* the next page.  Return false if there's no matching data at all.
*/
LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
if (!_bt_steppage(scan, dir))

First, it takes the previous tuple with OffsetNumberPrev (so tuple before 
(2,1), which is (1,10000)). This is done, because if this tuple were to be 
returned, there would be a call to _bt_next afterwards, which would then 
conveniently be on the tuple (2,1) that we want. However, _bt_readpage messes 
things up, because it only reads tuples that match all the provided keys (so 
where b=2). The only tuple it'll return is (2,2). This will be the tuple that 
is set, however, on the call to _bt_next, the tuple is first incremented, so 
we'll find (2,3) there which doesn't match our keys. This leads it to skip 
(2,2) in our result set.

I was wondering about something else: don't we also have another problem with 
updating this current index tuple by skipping before calling 
btgettuple/_bt_next? I see there's some code in btgettuple to kill dead tuples 
when scan->kill_prior_tuple is true. I'm not too familiar with the concept of 
killing dead tuples while doing index scans, but by looking at the code it 
seems to be possible that btgettuple returns a tuple, caller processes it and 
sets kill_prior_tuple to true in order to have it killed. However, then the 
skip scan kicks in, which sets the current tuple to a completely different 
tuple. Then, on the next call of btgettuple, the wrong tuple gets killed. Is my 
reasoning correct here or am I missing something?

-Floris


Reply via email to