On Fri, Jan 2, 2015 at 9:04 AM, Amit Kapila <amit.kapil...@gmail.com> wrote: > While working on parallel seq-scan patch to adapt this framework, I > noticed few things and have questions regrading the same. > > 1. > Currently parallel worker just attaches to error queue, for tuple queue > do you expect it to be done in the same place or in the caller supplied > function, if later then we need segment address as input to that function > to attach queue to the segment(shm_mq_attach()). > Another question, I have in this regard is that if we have redirected > messages to error queue by using pq_redirect_to_shm_mq, then how can > we set tuple queue for the same purpose. Similarly I think more handling > is needed for tuple queue in master backend and the answer to above will > dictate what is the best way to do it.
I've come to the conclusion that it's a bad idea for tuples to be sent through the same queue as errors. We want errors (or notices, but especially errors) to be processed promptly, but there may be a considerable delay in processing tuples. For example, imagine a plan that looks like this: Nested Loop -> Parallel Seq Scan on p -> Index Scan on q Index Scan: q.x = p.x The parallel workers should fill up the tuple queues used by the parallel seq scan so that the master doesn't have to do any of that work itself. Therefore, the normal situation will be that those tuple queues are all full. If an error occurs in a worker at that point, it can't add it to the tuple queue, because the tuple queue is full. But even if it could do that, the master then won't notice the error until it reads all of the queued-up tuple messges that are in the queue in front of the error, and maybe some messages from the other queues as well, since it probably round-robins between the queues or something like that. Basically, it could do a lot of extra work before noticing that error in there. Now we could avoid that by having the master read messages from the queue immediately and just save them off to local storage if they aren't error messages. But that's not very desirable either, because now we have no flow-control. The workers will just keep spamming tuples that the master isn't ready for into the queues, and the master will keep reading them and saving them to local storage, and eventually it will run out of memory and die. We could engineer some solution to this problem, of course, but it seems quite a bit simpler to just have two queues. The error queues don't need to be very big (I made them 16kB, which is trivial on any system on which you care about having working parallelism) and the tuple queues can be sized as needed to avoid pipeline stalls. > 2. > Currently there is no interface for wait_for_workers_to_become_ready() > in your patch, don't you think it is important that before start of fetching > tuples, we should make sure all workers are started, what if some worker > fails to start? I think that, in general, getting the most benefit out of parallelism means *avoiding* situations where backends have to wait for each other. If the relation being scanned is not too large, the user backend might be able to finish the whole scan - or a significant fraction of it - before the workers initialize. Of course in that case it might have been a bad idea to parallelize in the first place, but we should still try to make the best of the situation. If some worker fails to start, then instead of having the full degree N parallelism we were hoping for, we have some degree K < N, so things will take a little longer, but everything should still work. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers