Re: Some read stream improvements

Andres Freund Tue, 11 Mar 2025 12:29:58 -0700

Hi,

On 2025-03-12 07:35:46 +1300, Thomas Munro wrote:
> On Thu, Feb 27, 2025 at 11:20 PM Andres Freund <[email protected]> wrote:
> > On 2025-02-27 11:19:55 +1300, Thomas Munro wrote:
> > > On Wed, Feb 26, 2025 at 10:55 PM Andres Freund <[email protected]> wrote:
> > > > I was working on expanding tests for AIO and as part of that wrote a 
> > > > test for
> > > > temp tables -- our coverage is fairly awful, there were many times 
> > > > during AIO
> > > > development where I knew I had trivially reachable temp table specific 
> > > > bugs
> > > > but all tests passed.
> > > >
> > > > The test for that does trigger the problem described above and is fixed 
> > > > by the
> > > > patches in this thread (which I included in the other thread):
> 
> Here is a subset of those patches again:
> 
> 1.  Per-backend buffer limit, take III.  Now the check is in
> read_stream_start_pending_read() so TOC == TOU.
> 
> Annoyingly, test cases like the one below still fail, despite
> following the rules.  The other streams eat all the buffers and then
> one gets an allowance of zero, but uses its right to take one pin
> anyway to make progress, and there isn't one.


I think that may be ok. If there are no unpinned buffers, it seems to be
expected that starting a new stream will fail.  That's not the same as - as we
did previously - failing in a read stream that did start successfully, because
we issue large IOs even though there are only a small number of unpinned
buffers.


> I wonder if we should use temp_buffers - 100?  Then leave the minimum GUC
> value at 100 still, so you have an easy way to test with 0, 1,
> ... additional buffers?

I think that just makes it harder to test the exhaustion scenario without
really fixing anything?


> 2.  It shouldn't give up issuing random advice immediately after a
> jump, or it could stall on (say) the second 128kB of a 256kB
> sequential chunk (ie the strace you showed on the BHS thread).  It
> only makes sense to assume kernel readahead takes over once you've
> actually *read* sequentially.  In practice this makes it a lot more
> aggressive about advice (like the BHS code in master): it only gives
> up if the whole look-ahead window is sequential.


> 3.  Change the distance algorithm to care only about hits and misses,
> not sequential heuristics.  It made at least some sense before, but it
> doesn't make sense for AIO, and even in synchronous mode it means that
> you hit random jumps with insufficient look-ahead, so I don't think we
> should keep it.
> 
> I also realised that the sequential heuristics are confused by that
> hidden trailing block thing, so in contrived pattern testing with
> hit-miss-hit-miss... would be considered sequential, and even if you
> fix that (the forwarding patches above fix that), an exact
> hit-miss-hit-miss pattern also gets stuck between distances 1 and 2
> (double, decrement, double, ... might be worth waiting a bit longer
> before decrementing, IDK.
> 
> I'll rebase the others and post soon.



> +
> +/* see GetAdditionalPinLimit() */
> +uint32
> +GetAdditionalLocalPinLimit(void)
> +{
> +     Assert(NLocalPinnedBuffers <= num_temp_buffers);
> +     return num_temp_buffers - NLocalPinnedBuffers;
> +}

This doesn't behave quite the way GetAdditionalPinLimit() does - it can return
0. Which makes some sense, pinning an additional buffer will always
fail. Perhaps worth calling out though?


>  static void
>  read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
>  {
>       while (stream->ios_in_progress < stream->max_ios &&
> -                stream->pinned_buffers + stream->pending_read_nblocks < 
> stream->distance)
> +                ((stream->pinned_buffers == 0 && stream->distance > 0) ||
> +                     stream->pinned_buffers + stream->pending_read_nblocks < 
> stream->distance))

What does the new "stream->pinned_buffers == 0 && stream->distance > 0" really
mean? And when would it be true when the pre-existing condition wouldn't
already be true?


>       {
>               BlockNumber blocknum;
>               int16           buffer_index;
>               void       *per_buffer_data;
>  
> -             if (stream->pending_read_nblocks == io_combine_limit)
> +             /* If have a pending read that can't be extended, start it now. 
> */
> +             Assert(stream->pinned_buffers + stream->pending_read_nblocks <=
> +                        stream->max_pinned_buffers);
> +             if (stream->pending_read_nblocks == io_combine_limit ||
> +                     (stream->pinned_buffers == 0 &&
> +                      stream->pending_read_nblocks == 
> stream->max_pinned_buffers))
>               {
>                       read_stream_start_pending_read(stream, suppress_advice);
>                       suppress_advice = false;
> @@ -360,14 +409,15 @@ read_stream_look_ahead(ReadStream *stream, bool 
> suppress_advice)
>               /* We have to start the pending read before we can build 
> another. */
>               while (stream->pending_read_nblocks > 0)
>               {
> -                     read_stream_start_pending_read(stream, suppress_advice);
> -                     suppress_advice = false;
> -                     if (stream->ios_in_progress == stream->max_ios)
> +                     if (!read_stream_start_pending_read(stream, 
> suppress_advice) ||
> +                             stream->ios_in_progress == stream->max_ios)
>                       {
> -                             /* And we've hit the limit.  Rewind, and stop 
> here. */
> +                             /* And we've hit a buffer or I/O limit.  Rewind 
> and wait. */
>                               read_stream_unget_block(stream, blocknum);
>                               return;
>                       }
> +
> +                     suppress_advice = false;
>               }

If read_stream_start_pending_read() returns false because we hit the pin
limit, does it really help to call read_stream_unget_block()? IIUC that'll
defer one block for later - but what about the other buffers in a multi-block
read?



> @@ -260,16 +261,30 @@ read_stream_start_pending_read(ReadStream *stream, bool 
> suppress_advice)
>       else
>               Assert(stream->next_buffer_index == 
> stream->oldest_buffer_index);
>  
> -     /*
> -      * If advice hasn't been suppressed, this system supports it, and this
> -      * isn't a strictly sequential pattern, then we'll issue advice.
> -      */
> -     if (!suppress_advice &&
> -             stream->advice_enabled &&
> -             stream->pending_read_blocknum != stream->seq_blocknum)
> +     /* Do we need to issue read-ahead advice? */
> +     flags = 0;
> +     if (stream->advice_enabled)
> +     {
>               flags = READ_BUFFERS_ISSUE_ADVICE;
> -     else
> -             flags = 0;
> +
> +             if (stream->pending_read_blocknum == stream->seq_blocknum)
> +             {
> +                     /*
> +                      * Suppress advice if our WaitReadBuffers() calls have 
> caught up
> +                      * with the first advice we issued for this sequential 
> run.
> +                      */
> +                     if (stream->seq_start == InvalidBlockNumber)
> +                             suppress_advice = true;
> +             }
> +             else
> +             {
> +                     /* Random jump, so start a new sequential run. */
> +                     stream->seq_start = stream->pending_read_blocknum;
> +             }
> +
> +             if (suppress_advice)
> +                     flags = 0;
> +     }

Seems a bit confusing to first set
  flags = READ_BUFFERS_ISSUE_ADVICE
to then later unset it again. Maybe just set it in if (!suppress_advice)?



>        * Skip the initial ramp-up phase if the caller says we're going to be
> @@ -825,6 +842,15 @@ read_stream_next_buffer(ReadStream *stream, void 
> **per_buffer_data)
>                       distance = stream->distance * 2;
>                       distance = Min(distance, stream->max_pinned_buffers);
>                       stream->distance = distance;
> +
> +                     /*
> +                      * If we've caught up with the first advice issued for 
> the current
> +                      * sequential run, cancel further advice until the next 
> random
> +                      * jump.  The kernel should be able to see the pattern 
> now that
> +                      * we're issuing sequential preadv() calls.
> +                      */
> +                     if (stream->ios[io_index].op.blocknum == 
> stream->seq_start)
> +                             stream->seq_start = InvalidBlockNumber;

So stream->seq_start doesn't really denote the start of sequentialness, it
denotes up to where the caller needs to process before we disable sequential
access. Maybe add a comment to it and rename it to something like
->seq_until_processed?

Other than this the approach seems to make sense!

Greetings,

Andres Freund

Re: Some read stream improvements

Reply via email to