On 28/11/2023 14:17, Thomas Munro wrote:
On Thu, Sep 28, 2023 at 7:33 AM Heikki Linnakangas <hlinn...@iki.fi> wrote:
in streaming_read.h:
typedef bool (*PgStreamingReadBufferDetermineNextCB) (PgStreamingRead *pgsr,
uintptr_t pgsr_private,
void *per_io_private,
BufferManagerRelation *bmr,
ForkNumber *forkNum,
BlockNumber *blockNum,
ReadBufferMode *mode);
I was surprised that 'bmr', 'forkNum' and 'mode' are given separately on
each read. I see that you used that in the WAL replay prefetching, so I
guess that makes sense.
In this version I have introduced an alternative simple callback.
It's approximately what we had already tried out in an earlier version
before I started streamifying recovery, but in this version you can
choose, so recovery can opt for the wider callback.
Ok. Two APIs is a bit redundant, but because most callers would prefer
the simpler API, that's probably a good tradeoff.
I've added some ramp-up logic. The idea is that after we streamify
everything in sight, we don't want to penalise users that don't really
need more than one or two blocks, but don't know that yet. Here is
how the system calls look when you do pg_prewarm():
pread64(32, ..., 8192, 0) = 8192 <--- start with just one block
pread64(32, ..., 16384, 8192) = 16384
pread64(32, ..., 32768, 24576) = 32768
pread64(32, ..., 65536, 57344) = 65536
pread64(32, ..., 131072, 122880) = 131072 <--- soon reading 16
blocks at a time
pread64(32, ..., 131072, 253952) = 131072
pread64(32, ..., 131072, 385024) = 131072
I guess it could be done in quite a few different ways and I'm open to
better ideas. This way inserts prefetching stalls but ramps up
quickly and is soon out of the way. I wonder if we would want to make
that a policy that a caller can disable, if you want to skip the
ramp-up and go straight for the largest possible I/O size? Then I
think we'd need a 'flags' argument to the streaming read constructor
functions.
I think a 'flags' argument and a way to opt-out of the slow start would
make sense. pg_prewarm in particular knows that it will read the whole
relation.
A small detour: While contemplating how this interacts with parallel
sequential scan, which also has a notion of ramping up, I noticed
another problem. One parallel seq scan process does this:
fadvise64(32, 35127296, 131072, POSIX_FADV_WILLNEED) = 0
preadv(32, [...], 2, 35127296) = 131072
preadv(32, [...], 2, 35258368) = 131072
fadvise64(32, 36175872, 131072, POSIX_FADV_WILLNEED) = 0
preadv(32, [...], 2, 36175872) = 131072
preadv(32, [...], 2, 36306944) = 131072
...
We don't really want those fadvise() calls. We don't get them with
parallelism disabled, because streaming_read.c is careful not to
generate advice for sequential workloads based on ancient wisdom from
this mailing list, re-confirmed on recent Linux: WILLNEED hints
actually get in the way of Linux's own prefetching and slow you down,
so we only want them for truly random access. But the logic can't see
that another process is making holes in this process's sequence.
Hmm, aside from making the sequential pattern invisible to this process,
are we defeating Linux's logic too, just by performing the reads from
multiple processes? The processes might issue the reads to the kernel
out-of-order.
How bad is the slowdown when you issue WILLNEED hints on sequential access?
The two obvious solutions are (1) pass in a flag at the start saying
"I promise this is sequential even if it doesn't look like it, no
hints please" and (2) invent "shared" (cross-process) streaming
reads, and teach all the parallel seq scan processes to get their
buffers from there.
Idea (2) is interesting to think about but even if it is a useful idea
(not sure) it is certainly overkill just to solve this little problem
for now. So perhaps I should implement (1), which would be another
reason to add a flags argument. It's not a perfect solution though
because some more 'data driven' parallel scans (indexes, bitmaps, ...)
have a similar problem that is less amenable to top-down kludgery.
(1) seems fine to me.
--
Heikki Linnakangas
Neon (https://neon.tech)