On Wed, Jan 14, 2015 at 9:00 PM, Jim Nasby <jim.na...@bluetreble.com> wrote: > Simply doing > something like blkno % num_workers is going to cause imbalances,
Yes. > but trying > to do this on a per-block basis seems like too much overhead. ...but no. Or at least, I doubt it. The cost of handing out blocks one at a time is that, for each block, a worker's got to grab a spinlock, increment and record the block number counter, and release the spinlock. Or, use an atomic add. Now, it's true that spinlock cycles and atomic ops can have sometimes impose severe overhead, but you have to look at it as a percentage of the overall work being done. In this case, the backend has to read, pin, and lock the page and process every tuple on the page. Processing every tuple on the page may involve de-TOASTing the tuple (leading to many more page accesses), or evaluating a complex expression, or hitting CLOG to check visibility, but even if it doesn't, I think the amount of work that it takes to process all the tuples on the page will be far larger than the cost of one atomic increment operation per block. As mentioned downthread, a far bigger consideration is the I/O pattern we create. A sequential scan is so-called because it reads the relation sequentially. If we destroy that property, we will be more than slightly sad. It might be OK to do sequential scans of, say, each 1GB segment separately, but I'm pretty sure it would be a real bad idea to read 8kB at a time at blocks 0, 64, 128, 1, 65, 129, ... What I'm thinking about is that we might have something like this: struct this_lives_in_dynamic_shared_memory { BlockNumber last_block; Size prefetch_distance; Size prefetch_increment; slock_t mutex; BlockNumber next_prefetch_block; BlockNumber next_scan_block; }; Each worker takes the mutex and checks whether next_prefetch_block - next_scan_block < prefetch_distance and also whether next_prefetch_block < last_block. If both are true, it prefetches some number of additional blocks, as specified by prefetch_increment. Otherwise, it increments next_scan_block and scans the block corresponding to the old value. So in this way, the prefetching runs ahead of the scan by a configurable amount (prefetch_distance), which should be chosen so that the prefetches have time to compete before the scan actually reaches those blocks. Right now, of course, we rely on the operating system to prefetch for sequential scans, but I have a strong hunch that may not work on all systems if there are multiple processes doing the reads. Now, what of other strategies like dividing up the relation into 1GB chunks and reading each one in a separate process? We could certainly DO that, but what advantage does it have over this? The only benefit I can see is that you avoid accessing a data structure of the type shown above for every block, but that only matters if that cost is material, and I tend to think it won't be. On the flip side, it means that the granularity for dividing up work between processes is now very coarse - when there are less than 6GB of data left in a relation, at most 6 processes can work on it. That might be OK if the data is being read in from disk anyway, but it's certainly not the best we can do when the data is in memory. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers