Re: [HACKERS] Parallel Seq Scan

Robert Haas Fri, 16 Jan 2015 10:20:47 -0800

On Wed, Jan 14, 2015 at 9:00 PM, Jim Nasby <jim.na...@bluetreble.com> wrote:
> Simply doing
> something like blkno % num_workers is going to cause imbalances,


Yes.

> but trying
> to do this on a per-block basis seems like too much overhead.

...but no.  Or at least, I doubt it.  The cost of handing out blocks
one at a time is that, for each block, a worker's got to grab a
spinlock, increment and record the block number counter, and release
the spinlock.  Or, use an atomic add.  Now, it's true that spinlock
cycles and atomic ops can have sometimes impose severe overhead, but
you have to look at it as a percentage of the overall work being done.
In this case, the backend has to read, pin, and lock the page and
process every tuple on the page.  Processing every tuple on the page
may involve de-TOASTing the tuple (leading to many more page
accesses), or evaluating a complex expression, or hitting CLOG to
check visibility, but even if it doesn't, I think the amount of work
that it takes to process all the tuples on the page will be far larger
than the cost of one atomic increment operation per block.

As mentioned downthread, a far bigger consideration is the I/O pattern
we create.  A sequential scan is so-called because it reads the
relation sequentially.  If we destroy that property, we will be more
than slightly sad.  It might be OK to do sequential scans of, say,
each 1GB segment separately, but I'm pretty sure it would be a real
bad idea to read 8kB at a time at blocks 0, 64, 128, 1, 65, 129, ...

What I'm thinking about is that we might have something like this:

struct this_lives_in_dynamic_shared_memory
{
    BlockNumber last_block;
    Size prefetch_distance;
    Size prefetch_increment;
    slock_t mutex;
    BlockNumber next_prefetch_block;
    BlockNumber next_scan_block;
};

Each worker takes the mutex and checks whether next_prefetch_block -
next_scan_block < prefetch_distance and also whether
next_prefetch_block < last_block.  If both are true, it prefetches
some number of additional blocks, as specified by prefetch_increment.
Otherwise, it increments next_scan_block and scans the block
corresponding to the old value.

So in this way, the prefetching runs ahead of the scan by a
configurable amount (prefetch_distance), which should be chosen so
that the prefetches have time to compete before the scan actually
reaches those blocks.  Right now, of course, we rely on the operating
system to prefetch for sequential scans, but I have a strong hunch
that may not work on all systems if there are multiple processes doing
the reads.

Now, what of other strategies like dividing up the relation into 1GB
chunks and reading each one in a separate process?  We could certainly
DO that, but what advantage does it have over this?  The only benefit
I can see is that you avoid accessing a data structure of the type
shown above for every block, but that only matters if that cost is
material, and I tend to think it won't be.  On the flip side, it means
that the granularity for dividing up work between processes is now
very coarse - when there are less than 6GB of data left in a relation,
at most 6 processes can work on it.  That might be OK if the data is
being read in from disk anyway, but it's certainly not the best we can
do when the data is in memory.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Seq Scan

Reply via email to