As per discussion on another thread related to using custom scan nodes for prototype of parallel sequence scan, I have developed the same, but directly by adding new nodes for parallel sequence scan. There might be some advantages for developing this as a contrib module by using custom scan nodes, however I think we might get stucked after some point due to custom scan node capability as pointed out by Andres.
The basic idea used is that while evaluating the cheapest path for scan, optimizer will also evaluate if it can use parallel seq path. Currently I have kept a very simple model to calculate the cost of parallel sequence path which is that divide the cost for CPU and disk by availble number of worker backends (We can enhance it based on further experiments and discussion; we need to consider worker startup and dynamic shared memory setup cost as well). The work aka scan of blocks is divided equally among all workers (except for corner cases where blocks can't be equally divided among workers, the last worker will be responsible for scanning the remaining blocks). The number of worker backends that can be used for parallel seq scan can be configured by using a new GUC parallel_seqscan_degree, the default value of which is zero and it means parallel seq scan will not be considered unless user configures this value. In ExecutorStart phase, initiate the required number of workers as per parallel seq scan plan and setup dynamic shared memory and share the information required for worker to execute the scan. Currently I have just shared the relId, targetlist and number of blocks to be scanned by worker, however I think we might want to generate a plan for each of the workers in master backend and then share the same to individual worker. Now to fetch the data from multiple queues corresponding to each worker a simple mechanism is used that is fetch from first queue till all the data is consumed from same, then fetch from second queue and so on. Also here master backend is responsible for just getting the data from workers and passing it back to client. I am sure that we can improve this strategy in many ways like by making master backend to also perform scan for some of the blocks rather than just getting data from workers and a better strategy to fetch the data from multiple queues. Worker backend will receive the information related to scan from master backend and generate the plan from same and execute that plan, so here the work to scan the data after generating the plan is very much similar to exec_simple_query() (i.e Create the portal and run it based on planned statement) except that worker backends will initialize the block range it want to scan in executor initialization phase (ExecInitSeqScan()). Workers will exit after sending the data to master backend which essentially means that for each execution we need to initiate the workers, I think here we can improve by giving the control for workers to postmaster so that we don't need to initialize them each time during execution, however this can be a totally separate optimization which is better to be done independently of this patch. As currently we don't have mechanism to share transaction state, I have used separate transaction in worker backend to execute the plan. Any error in master backend either via backend worker or due to other issue in master backend itself should terminate all the workers before aborting the transaction. We can't do it with the error context callback mechanism (error_context_stack) which we use at other places in code, as for this case we need it from the time workers are started till the execution is complete (error_context_stack could get reset once the control goes out of the function which has set it.) One way could be that maintain the callback information in TransactionState and use it to kill the workers before aborting transaction in main backend. Another could be that have another variable similar to error_context_stack (which will be used specifically for storing the workers state), and kill the workers in errfinish via callback. Currently I have handled it at the time of detaching from shared memory. Another point that needs to be taken care in worker backend is that if any error occurs, we should *not* abort the transaction as the transaction state is shared across all workers. Currently the parallel seq scan will not be considered for statements other than SELECT or if there is a join in the statement or if statement contains quals or if target list contains non-Var fields. We can definitely support simple quals and targetlist other than non-Vars. By simple, I means that it should not contain functions or some other conditions which can't be pushed down to worker backend. Behaviour of some simple statements with patch is as below: postgres=# create table t1(c1 int, c2 char(500)) with (fillfactor=10); CREATE TABLE postgres=# insert into t1 values(generate_series(1,100),'amit'); INSERT 0 100 postgres=# explain select c1 from t1; QUERY PLAN ------------------------------------------------------ Seq Scan on t1 (cost=0.00..101.00 rows=100 width=4) (1 row) postgres=# set parallel_seqscan_degree=4; SET postgres=# explain select c1 from t1; QUERY PLAN -------------------------------------------------------------- Parallel Seq Scan on t1 (cost=0.00..25.25 rows=100 width=4) Number of Workers: 4 Number of Blocks Per Workers: 25 (3 rows) postgres=# explain select Distinct(c1) from t1; QUERY PLAN -------------------------------------------------------------------- HashAggregate (cost=25.50..26.50 rows=100 width=4) Group Key: c1 -> Parallel Seq Scan on t1 (cost=0.00..25.25 rows=100 width=4) Number of Workers: 4 Number of Blocks Per Workers: 25 (5 rows) Attached patch is just to facilitate the discussion about the parallel seq scan and may be some other dependent tasks like sharing of various states like combocid, snapshot with parallel workers. It is by no means ready to do any complex test, ofcourse I will work towards making it more robust both in terms of adding more stuff and doing performance optimizations. Thoughts/Suggestions? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
parallel_seqscan_v1.patch
Description: Binary data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers