On Fri, Nov 13, 2015 at 10:56 AM, Robert Haas <robertmh...@gmail.com> wrote: > > On Thu, Nov 12, 2015 at 10:39 PM, Amit Kapila <amit.kapil...@gmail.com> wrote: > > The number of shared buffers hit could be different across different runs > > because the read sequence of parallel workers can't be guaranteed, also > > I don't think same is even guaranteed for Seq Scan node, > > The number of hits could be different. However, it seems like any > sequential scan, parallel or not, should have a number of accesses > (hit + read) equal to the size of the relation. Not sure if that's > what is happening here. >
After patch provided above to fix the issue reported by Pavel, that is the behaviour, but I think there are few more things which we might want to consider, just refer the below plan: Total pages in table -------------------------------- postgres=# select relname,relpages from pg_class where relname like '%t2%'; relname | relpages ---------+---------- t2 | 5406 (1 row) Parallel Plan ----------------------------- postgres=# explain (analyze,buffers,timing) select count(*) from t2 where c1 % 1 0 = 0; QUERY PLAN -------------------------------------------------------------------------------- ------------------------------------------------ Aggregate (cost=8174.90..8174.91 rows=1 width=0) (actual time=1055.294..1055.2 94 rows=1 loops=1) Buffers: shared hit=446 read=5054 -> Gather (cost=0.00..8162.40 rows=5000 width=0) (actual time=79.787..959.6 51 rows=100000 loops=1) Number of Workers: 2 Buffers: shared hit=446 read=5054 -> Parallel Seq Scan on t2 (cost=0.00..8162.40 rows=5000 width=0) (ac tual time=30.771..2518.844 rows=100000 loops=1) Filter: ((c1 % 10) = 0) Rows Removed by Filter: 900000 Buffers: shared hit=352 read=5054 Planning time: 0.170 ms Execution time: 1059.400 ms (11 rows) Lets focus on Buffers and actual time in the above plan: Buffers - At Parallel Seq Scan node, it shows total of 5406 (352+5054) buffers which tallys with what is expected. However at Gather node, it shows 5500 (446+5054) and the reason for the same is that we accumulate overall buffer usage for parallel execution of worker which includes start of node as well, refer ParallelQueryMain() and when the that gets counted even towards the buffer calculation of Gather node. The theory behind collecting overall buffer usage for parallel execution was that we need it for pg_stat_statements where the stats is accumulated for overall execution not on node-by-node basis refer queryDesc->totaltime usage in standard_ExecutorRun(). I think here we need to decide what is the right value to display at Gather node: 1. Display the same number of buffers at Gather node as at Parallel Seq Scan node. 2. Display the number of buffers at Parallel Seq Scan node plus the additional buffers used by parallel workers for miscellaneous work like ExecutorStart(), etc. 3. Don't account for buffers used for parallel workers. 4. Anything better? Also in conjuction with above, we need to see what should be accounted for pg_stat_statements? actual_time - Actual time at Gather node: actual time = 79.787..959.651 Actual time at Parallel Seq Scan node = 30.771..2518.844 Time at Parallel Seq Scan node is more than time at Gather node as the time for parallel workers is also accumulated for Parallel Seq Scan node, whereas some doesn't get accounted for Gather node. Now it could be confusing for users because time displayed at Parallel Seq Scan node will be equal to - time_taken_by_worker-1 + time_taken_by_worker-2 + ... This time could be more than the actual time taken by query because each worker is getting executed parallely but we have accounted the time such that each one is executing serially. I think the time for fetching the tuples from workers is already accounted for Gather node, so may be for Parallel Seq Scan node we can omit adding the time for each of the parallel workers. Thoughts? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com