On Thu, Nov 26, 2015 at 3:45 AM, Simon Riggs <si...@2ndquadrant.com> wrote: > Sounds like good progress.
Thanks. > This gives us multiple copies of the hash table, which means we must either > use N * work_mem, or we must limit the hash table to work_mem / N per > partial plan. We use N * work_mem in this patch. The other option would be a lot more expensive in terms of planning time, because we'd have to generate one set of hash join paths (or at least hash join costs) for use in serial plans and another set for use in parallel plans. As it is, the parallel stuff just piggybacks on the plans we're generating anyway. We might have to change that at some point, but I think we'll do well to put that point off as long as possible. > How can the partial paths share a hash table? I'm glad you asked that question. For that, we need an allocator that can work with dynamic shared memory segments, and a hash table built on top of that allocator. It's relatively complicated because the same DSM segments can be mapped at different addresses in different processes, so you can't use native pointers. However, I'm pleased to report that my colleague Thomas Munro is working on this problem, and that he will submit this work to pgsql-hackers when it's ready, which may be soon enough that we can consider including this in 9.6 if the design meets with general approval. As you may recall, I previously proposed an allocator framework, somewhat of a WIP in progress at that time, and the reaction here was a bit lukewarm, which is why I shifted from parallel sort to parallel seq scan as a first project. I now think that was a good decision, and as a result of Peter Geoghegan's work on sorting and my own experience further developing the parallelism code, I no longer think that allocator is the right solution for parallel sort anyway. But I still think it might be the right solution for a parallel hash join with a shared hash table. My idea is that you'd end up with a plan like this: Gather -> Hash Join -> Parallel Seq Scan -> Parallel Hash -> Parallel Seq Scan Not only does this build only one copy of the hash table instead of N copies, but we can parallelize the hash table construction itself by having all workers insert in parallel, which is pretty cool. However, I don't expect this to be better than an unshared hash table in all cases. We have a fair amount of evidence that accessing backend-private data structures can sometimes be much faster than accessing shared data structures - cf. not only the syscaches and relcache, but also the use of Materialize nodes by the planner in certain cases even when there are no filtering quals underneath. So there's probably going to be a need to consider both types of plans and decide between them based on memory usage and expected runtime. The details are not all clear to me yet, and most likely we'll have to postpone some of the decisions until the code is written and can be performance-tested to see in which cases one strategy performs better or worse than the other. What I can confirm at this point is that I've thought about the problem you're asking about here, and that EnterpriseDB intends to contribute code to address it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers