On Sat, Dec 28, 2024 at 10:26 AM Jeremy Schneider <schnei...@ardentperf.com> wrote: >
Thanks for the feedback, Jeremy! > While I don't have a detailed design in mind, I'd like to add a strong > +1 on the general idea that work_mem is hard to effectively use because > queries can vary so widely in how many nodes might need work memory. That's my main concern with work_mem. Even if I am running a single query at a time, on a single connection, I still run into the limitations of the work_mem GUC if I try to run: (a) a query with a single large Hash Join; vs. (b) a query with, say, 10 Hash Joins, some of which are small and others of which are large. In (a), I want to set work_mem to the total amount of memory available on the system. In (b), I *don't* want to set it to "(a) / 10", because I want some Hash Joins to get more memory than others. It's just really hard to distribute working memory intelligently, using this GUC! > I'd almost like to have two limits: > > First, a hard per-connection limit which could be set very high > ... If > palloc detects that an allocation would take the total over the hard > limit, then you just fail the palloc with an OOM. This protects > postgres from a memory leak in a single backend OOM'ing the whole > system and restarting the whole database; failing a single connection > is better than failing all of them. Yes, this seems fine to me. Per Tomas's comments in [1] ,the "OOM" limit would probably have to be global. > Second, a soft per-connection "total_connection_work_mem_target" which > could be set lower. The planner can just look at the total number of > nodes that it expects to allocate work memory, divide the target by > this and then set the work_mem for that query. I wouldn't even bother to inform the planner of the "actual" working memory, because that adds complexity to the query compilation process. Just: planner assumes it will have "work_mem" per node, but records how much memory it estimates it will use ("nbytes") [2]. The executor then sets the "actual" work_mem, for each node, based on nbytes and the global "work_mem" GUC. (It's hard for the planner to decide how much memory each path / node should get, while building and costing plans. E.g., is it better to take X MB of working memory from join 1 and hand it to join 2, which would cause join 1 to spill Y% more data, but join 2 to spill Z% less?) > Maybe even could do a "total_instance_work_mem_target" where it's > divided by the number of average active connections or something. This seems fine to me, but I prefer "backend_work_mem_target", because I find multiplying easier than dividing. (E.g., if I add 2 more backends, I would use up to "backend_work_mem_target" * 2 more memory. Vs., each query now gets "total_instance_work_mem_target / (N+2)" memory...) [1] https://www.postgresql.org/message-id/4806d917-c019-49c7-9182-1203129cd295%40vondra.me [2] https://www.postgresql.org/message-id/CAJVSvF6i_1Em6VPZ9po5wyTubGwifvfNFLrOYrdgT-e1GmR5Fw%40mail.gmail.com Thanks, James