On 3/16/24 20:12, Andres Freund wrote:
> Hi,
> 
> On 2024-03-15 18:42:29 -0400, Melanie Plageman wrote:
>> On Fri, Mar 15, 2024 at 5:14 PM Andres Freund <and...@anarazel.de> wrote:
>>> On 2024-03-14 17:39:30 -0400, Melanie Plageman wrote:
>>> I spent a good amount of time looking into this with Melanie. After a bunch 
>>> of
>>> wrong paths I think I found the issue: We end up prefetching blocks we have
>>> already read. Notably this happens even as-is on master - just not as
>>> frequently as after moving BitmapAdjustPrefetchIterator().
>>>
>>> From what I can tell the prefetching in parallel bitmap heap scans is
>>> thoroughly broken.  I added some tracking of the last block read, the last
>>> block prefetched to ParallelBitmapHeapState and found that with a small
>>> effective_io_concurrency we end up with ~18% of prefetches being of blocks 
>>> we
>>> already read! After moving the BitmapAdjustPrefetchIterator() to rises to 
>>> 86%,
>>> no wonder it's slower...
>>>
>>> The race here seems fairly substantial - we're moving the two iterators
>>> independently from each other, in multiple processes, without useful 
>>> locking.
>>>
>>> I'm inclined to think this is a bug we ought to fix in the backbranches.
>>
>> Thinking about how to fix this, perhaps we could keep the current max
>> block number in the ParallelBitmapHeapState and then when prefetching,
>> workers could loop calling tbm_shared_iterate() until they've found a
>> block at least prefetch_pages ahead of the current block. They
>> wouldn't need to read the current max value from the parallel state on
>> each iteration. Even checking it once and storing that value in a
>> local variable prevented prefetching blocks after reading them in my
>> example repro of the issue.
> 
> That would address some of the worst behaviour, but it doesn't really seem to
> address the underlying problem of the two iterators being modified
> independently. ISTM the proper fix would be to protect the state of the
> iterators with a single lock, rather than pushing down the locking into the
> bitmap code. OTOH, we'll only need one lock going forward, so being economic
> in the effort of fixing this is also important.
> 

Can you share some details about how you identified the problem, counted
the prefetches that happen too late, etc? I'd like to try to reproduce
this to understand the issue better.

If I understand correctly, what may happen is that a worker reads blocks
from the "prefetch" iterator, but before it manages to issue the
posix_fadvise, some other worker already did pread. Or can the iterators
get "out of sync" in a more fundamental way?

If my understanding is correct, why would a single lock solve that? Yes,
we'd advance the iterators at the same time, but surely we'd not issue
the fadvise calls while holding the lock, and the prefetch/fadvise for a
particular block could still happen in different workers.

I suppose a dirty PoC fix should not be too difficult, and it'd allow us
to check if it works.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply via email to