Optimize planner memory consumption for huge arrays
Hi, hackers, Looking at the planner behaviour with the memory consumption patch [1], I figured out that arrays increase memory consumption by the optimizer significantly. See init.sql in attachment. The point here is that the planner does small memory allocations for each element during estimation. As a result, it looks like the planner consumes about 250 bytes for each integer element. It is maybe not a problem most of the time. However, in the case of partitions, memory consumption multiplies by each partition. Such a corner case looks weird, but the fix is simple. So, why not? The diff in the attachment is proof of concept showing how to reduce wasting of memory. Having benchmarked a bit, I didn't find any overhead. [1] Report planning memory in EXPLAIN ANALYZE https://www.postgresql.org/message-id/flat/CAExHW5sZA=5lj_zppro-w09ck8z9p7eayaqq3ks9gdfhrxe...@mail.gmail.com -- Regards, Andrey Lepikhov init.sql Description: application/sql array_compact_memusage.diff Description: Binary data
Re: Optimize planner memory consumption for huge arrays
On Mon, Sep 4, 2023, at 3:37 PM, Dilip Kumar wrote: > On Mon, Sep 4, 2023 at 11:58 AM Lepikhov Andrei > wrote: >> >> Hi, hackers, >> >> Looking at the planner behaviour with the memory consumption patch [1], I >> figured out that arrays increase memory consumption by the optimizer >> significantly. See init.sql in attachment. >> The point here is that the planner does small memory allocations for each >> element during estimation. As a result, it looks like the planner consumes >> about 250 bytes for each integer element. >> >> It is maybe not a problem most of the time. However, in the case of >> partitions, memory consumption multiplies by each partition. Such a corner >> case looks weird, but the fix is simple. So, why not? >> >> The diff in the attachment is proof of concept showing how to reduce wasting >> of memory. Having benchmarked a bit, I didn't find any overhead. > > + Const *c = makeConst(nominal_element_type, > + -1, > + nominal_element_collation, > + elmlen, > + elem_values[i], > + elem_nulls[i], > + elmbyval); > + > + args = list_make2(leftop, c); > if (is_join_clause) > s2 = DatumGetFloat8(FunctionCall5Coll(&oprselproc, > clause->inputcollid, > @@ -1984,7 +1985,8 @@ scalararraysel(PlannerInfo *root, > ObjectIdGetDatum(operator), > PointerGetDatum(args), > Int32GetDatum(varRelid))); > - > + list_free(args); > + pfree(c); > > Maybe you can just use list_free_deep, instead of storing the constant > in a separate variable. As I see, the first element in the array is leftop, which is used on other iterations. So, we can't use list_free_deep here. -- Regards, Andrei Lepikhov
Re: Report planning memory in EXPLAIN ANALYZE
Using your patch I found out one redundant memory usage in the planner [1]. It can be interesting as an example of how this patch can detect problems. [1] Optimize planner memory consumption for huge arrays https://www.postgresql.org/message-id/flat/em9939439a-441a-4b27-a977-ebdf5987dc49%407d14f008.com -- Regards, Andrei Lepikhov On Thu, Aug 24, 2023, at 5:31 PM, Ashutosh Bapat wrote: > Sorry for the late reply. I was working on David's suggestion. > > Here's a response to your questions and also a new set of patches. > > On Tue, Aug 22, 2023 at 1:16 PM jian he wrote: >> Hi. I tested it. >> not sure if following is desired behavior. first run with explain, >> then run with explain(summary on). >> the second time, Planning Memory: 0 bytes. >> >> regression=# PREPARE q4 AS SELECT 1 AS a; >> explain EXECUTE q4; >> QUERY PLAN >> -- >> Result (cost=0.00..0.01 rows=1 width=4) >> (1 row) >> >> regression=# explain(summary on) EXECUTE q4; >> QUERY PLAN >> -- >> Result (cost=0.00..0.01 rows=1 width=4) >> Planning Time: 0.009 ms >> Planning Memory: 0 bytes >> (3 rows) >> - > > Yes. This is expected since the plan is already available and no > memory is required to fetch it from the cache. I imagine, if there > were parameters to the prepared plan, it would consume some memory to > evaluate those parameters and some more memory if replanning was > required. > > >> previously, if you want stats of a given memory context and its >> children, you can only use MemoryContextStatsDetail. >> but it will only go to stderr or LOG_SERVER_ONLY. >> Now, MemoryContextMemUsed is being exposed. I can do something like: >> >> mem_consumed = MemoryContextMemUsed(CurrentMemoryContext); >> //do stuff. >> mem_consumed = MemoryContextMemUsed(CurrentMemoryContext) - mem_consumed; >> >> it will give me the NET memory consumed by doing staff in between. Is >> my understanding correct? > > Yes. > > Here are three patches based on the latest master. > > 0001 > > this is same as the previous patch with few things fixed. 1. Call > MemoryContextMemUsed() before INSTR_TIME_SET_CURRENT so that the time > taken by MemoryContextMemUsed() is not counted in planning time. 2. In > ExplainOnePlan, use a separate code block for reporting memory. > > 0002 > > This patch reports both memory allocated and memory used in the > CurrentMemoryContext at the time of planning. It converts "Planning > Memory" into a section with two values reported as "used" and > "allocated" as below > > #explain (summary on) select * from pg_class c, pg_type t where > c.reltype = t.oid; > QUERY PLAN > -- > Hash Join (cost=28.84..47.08 rows=414 width=533) >... snip ... > Planning Time: 9.274 ms > Planning Memory: used=80848 bytes allocated=90112 bytes > (7 rows) > > In JSON format > #explain (summary on, format json) select * from pg_class c, pg_type t > where c.reltype = t.oid; > QUERY PLAN > --- > [+ >{ + > "Plan": {+ > ... snip ... > }, + > "Planning Time": 0.466, + > "Planning Memory": { + >"Used": 80608, + >"Allocated": 90112 + > }+ >} + > ] > (1 row) > > PFA explain and explain analyze output in all the formats. > > The patch adds MemoryContextMemConsumed() which is similar to > MemoryContextStats() or MemoryContextStatsDetails() except 1. the > later always prints the memory statistics to either stderr or to the > server error log and 2. it doesn't return MemoryContextCounters that > it gathered. We should probably change MemoryContextStats or > MemoryContextStatsDetails() according to those two points and not add > MemoryContextMemConsumed(). > > I have not merged this into 0001 yet. But once we agree upon whether > this is the right thing to do, I will merge it into 0001. > > 0003 > > When reporting memory allocated, a confusion may arise as to whether > to report the "net" memory allocated between start and end of planner > OR only the memory that remains allocated after end. This confusion > can be avoided by using an exclusive memory context (child of > CurrentMemoryContext) for planning. That's what 0003 implements as > suggested by David. As mentioned in one of the comments in the patch, > in order to measure memory allocated accurately the new MemoryContext > has to be of the same type as the memory context that will be used > otherwise by the pla
Re: Optimize planner memory consumption for huge arrays
On Wed, Sep 6, 2023, at 8:09 PM, Ashutosh Bapat wrote: > Hi Lepikhov, > > Thanks for using my patch and I am glad that you found it useful. > > On Mon, Sep 4, 2023 at 10:56 AM Lepikhov Andrei > wrote: >> >> Hi, hackers, >> >> Looking at the planner behaviour with the memory consumption patch [1], I >> figured out that arrays increase memory consumption by the optimizer >> significantly. See init.sql in attachment. >> The point here is that the planner does small memory allocations for each >> element during estimation. As a result, it looks like the planner consumes >> about 250 bytes for each integer element. > > I guess the numbers you mentioned in init.sql are total memory used by > the planner (as reported by the patch in the thread) when planning > that query and not memory consumed by Const nodes themselves. Am I > right? I think the measurements need to be explained better and also > the realistic scenario you are trying to oprimize. Yes, it is the total memory consumed by the planner - I used the numbers generated by your patch [1]. I had been increasing the number of elements in the array to exclude the memory consumed by the planner for other purposes. As you can see, the array with 1 element consumes 12kB of memory, 1E4 elements - 2.6 MB. All of that memory increment is related to the only enlargement of this array. (2600-12)/10 = 260 bytes. So, I make a conclusion: each 4-byte element produces a consumption of 260 bytes of memory. This scenario I obtained from the user complaint - they had strict restrictions on memory usage and were stuck in this unusual memory usage case. > I guess, the reason you think that partitioning will increase the > memory consumed is because each partition will have the clause > translated for it. Selectivity estimation for each partition will > create those many Const nodes and hence consume memory. Am I right? Yes. > Can you please measure the memory consumed with and without your > patch. Done. See test case and results in 'init_parts.sql' in attachment. Short summary below. I varied a number of elements from 1 to 1 and partitions from 1 to 100. As you can see, partitioning adds a lot of memory consumption by itself. But we see an effect from patch also. master: elems 1 1E1 1E2 1E3 1E4 parts 1 28kB50kB0.3MB 2.5MB 25MB 10 45kB143kB 0.6MB 4.8MB 47MB 100 208kB 125kB 3.3MB 27MB274MB patched: elems 1 1E1 1E2 1E3 1E4 parts 1 28kB48kB0.25MB 2.2MB 22.8MB 10 44kB100kB 313kB 2.4MB 23.7MB 100 208kB 101kB 0.9MB 3.7MB 32.4MB Just for comparison, without partitioning: elems 1 1E1 1E2 1E3 1E4 master: 12kB14kB37kB266kB 2.5MB patched:12kB11.5kB 13kB24kB141kB >> It is maybe not a problem most of the time. However, in the case of >> partitions, memory consumption multiplies by each partition. Such a corner >> case looks weird, but the fix is simple. So, why not? > > With vectorized operations becoming a norm these days, it's possible > to have thousands of element in array of an ANY or IN clause. Also > will be common to have thousands of partitions. But I think what we > need to do here is to write a selectivity estimation function which > takes an const array and return selectivity without requiring to > create a Const node for each element. Maybe you're right. Could you show any examples of vectorized usage of postgres to understand your idea more clearly? Here I propose only quick simple solution. I don't think it would change the way of development. >> The diff in the attachment is proof of concept showing how to reduce wasting >> of memory. Having benchmarked a bit, I didn't find any overhead. >> > > You might want to include your benchmarking results as well. Here is nothing interesting. pgbench TPS and planning time for the cases above doesn't change planning time. [1] Report planning memory in EXPLAIN ANALYZE -- Regards, Andrei Lepikhov init_parts.sql Description: application/sql
Re: MergeJoin beats HashJoin in the case of multiple hash clauses
On Mon, Sep 11, 2023, at 11:51 AM, Andy Fan wrote: > Hi, > > On Thu, Jun 15, 2023 at 4:30 PM Andrey Lepikhov > wrote: >> Hi, all. >> >> Some of my clients use JOIN's with three - four clauses. Quite >> frequently, I see complaints on unreasonable switch of JOIN algorithm to >> Merge Join instead of Hash Join. Quick research have shown one weak >> place - estimation of an average bucket size in final_cost_hashjoin (see >> q2.sql in attachment) with very conservative strategy. >> Unlike estimation of groups, here we use smallest ndistinct value across >> all buckets instead of multiplying them (or trying to make multivariate >> analysis). >> It works fine for the case of one clause. But if we have many clauses, >> and if each has high value of ndistinct, we will overestimate average >> size of a bucket and, as a result, prefer to use Merge Join. As the >> example in attachment shows, it leads to worse plan than possible, >> sometimes drastically worse. >> I assume, this is done with fear of functional dependencies between hash >> clause components. But as for me, here we should go the same way, as >> estimation of groups. > > I can reproduce the visitation you want to improve and verify the patch > can do it expectedly. I think this is a right thing to do. > >> The attached patch shows a sketch of the solution. > > I understand that this is a sketch of the solution, but the below > changes still > make me confused. > > + if (innerbucketsize > virtualbuckets) > + innerbucketsize = 1.0 / virtualbuckets; > > innerbucketsize is a fraction of rows in all the rows, so it is between > 0.0 and 1.0. > and virtualbuckets is the number of buckets in total (when considered > the mutli > batchs), how is it possible for 'innerbucketsize > virtualbuckets' ? > Am > I missing something? You are right here. I've made a mistake here. Changed diff is in attachment. -- Regards, Andrei Lepikhov fix_bucketsize_v2.diff Description: Binary data
Re: POC: GUC option for skipping shared buffers in core dumps
On Wed, Feb 12, 2020, at 7:55 AM, tsunakawa.ta...@fujitsu.com wrote: > From: Craig Ringer >> Currently my options are "dump all shmem including shared_buffers" or >> "dump no shmem". But I usually want "dump all shmem except >> shared_buffers". It's tolerable to just dump s_b on a test system with >> a small s_b, but if enabling coredumps to track down some >> hard-to-repro crash on a production system I really don't want 20GB >> coredumps... > > We have a simple implementation that allows to exclude shared memory. > That's been working for years. > > [postgresql.conf] > core_directory = 'location of core dumps' > core_contents = '{none | minimum | full}' > # none = doesn't dump core, minimum excludes shared memory, and full dumps all > > I can provide it. But it simply changes the current directory and > detaches shared memory when postgres receives signals that dump core. > > I made this GUC because Windows also had to be dealt with. If it's still possible, share your patch here. I don't know what about the core, but during development, especially the bug-fixing process, it is really dull to wait for the core generation process every time, even if you debug a planner issue and are not interested in shared memory blocks ... -- Regards, Andrei Lepikhov
Re: RFC: Logging plan of the running query
On Thu, Sep 7, 2023, at 1:09 PM, torikoshia wrote: > On 2023-09-06 11:17, James Coleman wrote: > It seems that we can know what queries were running from the stack > traces you shared. > As described above, I suspect a lock which was acquired prior to > ProcessLogQueryPlanInterrupt() caused the issue. > I will try a little more to see if I can devise a way to create the same > situation. Hi, I have explored this patch and, by and large, agree with the code. But some questions I want to discuss: 1. Maybe add a hook to give a chance for extensions, like pg_query_state, to do their job? 2. In this implementation, ProcessInterrupts does a lot of work during the explain creation: memory allocations, pass through the plan, systables locks, syscache access, etc. I guess it can change the semantic meaning of 'safe place' where CHECK_FOR_INTERRUPTS can be called - I can imagine a SELECT query, which would call a stored procedure, which executes some DDL and acquires row exclusive lock at the time of interruption. So, in my mind, it is too unpredictable to make the explain in the place of interruption processing. Maybe it is better to think about some hook at the end of ExecProcNode, where a pending explain could be created? Also, I suggest minor code change with the diff in attachment. -- Regards, Andrei Lepikhov improve.diff Description: Binary data
Re: RFC: Logging plan of the running query
On Tue, Sep 19, 2023, at 8:39 PM, torikoshia wrote: > On 2023-09-15 15:21, Lepikhov Andrei wrote: >> On Thu, Sep 7, 2023, at 1:09 PM, torikoshia wrote: >> I have explored this patch and, by and large, agree with the code. But >> some questions I want to discuss: >> 1. Maybe add a hook to give a chance for extensions, like >> pg_query_state, to do their job? > > Do you imagine adding a hook which enables adding custom interrupt codes > like below? > > https://github.com/postgrespro/pg_query_state/blob/master/patches/custom_signals_15.0.patch No, I think around the hook, which would allow us to rewrite the pg_query_state extension without additional patches by using the functionality provided by your patch. I mean, an extension could provide console UI, not only server logging. And obtain from target backend so much information in the explain as the instrumentation level of the current query can give. It may make pg_query_state shorter and more stable. >> 2. In this implementation, ProcessInterrupts does a lot of work during >> the explain creation: memory allocations, pass through the plan, >> systables locks, syscache access, etc. I guess it can change the >> semantic meaning of 'safe place' where CHECK_FOR_INTERRUPTS can be >> called - I can imagine a SELECT query, which would call a stored >> procedure, which executes some DDL and acquires row exclusive lock at >> the time of interruption. So, in my mind, it is too unpredictable to >> make the explain in the place of interruption processing. Maybe it is >> better to think about some hook at the end of ExecProcNode, where a >> pending explain could be created? > > Yeah, I withdrew this patch once for that reason, but we are resuming > development in response to the results of a discussion by James and > others at this year's pgcon about the safety of this process in CFI: > > https://www.postgresql.org/message-id/CAAaqYe9euUZD8bkjXTVcD9e4n5c7kzHzcvuCJXt-xds9X4c7Fw%40mail.gmail.com I can't track the logic path of the decision provided at this conference. But my anxiety related to specific place, where ActiveQueryDesc points top-level query and interruption comes during DDL, wrapped up in stored procedure. For example: CREATE TABLE test(); CREATE OR REPLACE FUNCTION ddl() RETURNS void AS $$ BEGIN EXECUTE format('ALTER TABLE test ADD COLUMN x integer;'); ... END; $$ LANGUAGE plpgsql VOLATILE; SELECT ddl(), ... FROM ...; > BTW it seems pg_query_state also enables users to get running query > plans using CFI. > Does pg_query_state do something for this safety concern? No, and I'm looking for the solution, which could help to rewrite pg_query_state as a clean extension, without patches. >> Also, I suggest minor code change with the diff in attachment. > > Thanks! > > This might be biased opinion and objections are welcomed, but I feel the > original patch is easier to read since PG_RETURN_BOOL(true/false) is > located in near place to each cases. > Also the existing function pg_log_backend_memory_contexts(), which does > the same thing, has the same form as the original patch. I got it, thank you. -- Regards, Andrei Lepikhov
Re: disfavoring unparameterized nested loops
On Wed, Sep 20, 2023, at 4:49 PM, David Rowley wrote: > On Wed, 20 Sept 2023 at 19:56, Andrey Lepikhov > wrote: >> What could you say about a different way: hybrid join? In MS SQL Server, >> they have such a feature [1], and, according to their description, it >> requires low overhead. They start from HashJoin and switch to NestLoop >> if the inner input contains too small tuples. It solves the issue, Isn't it? > > A complexity which you may not be considering here is that Nested Loop > joins always preserve the tuple order from the outer side of the join, > whereas hash joins will not do this when multi-batching. My idea here is the same as MS SQL guys did: prefetch from the HashJoin inner some predefined number of tuples and, if the planner has made a mistake and overestimated it, move hash join inner to NestLoop as an outer. The opposite strategy, "underestimation" - starting with NestLoop and switching to HashJoin looks more difficult, but the main question is: is it worthwhile to research? > I've no idea how the SQL Server engineers solved that. >> [1] >> https://techcommunity.microsoft.com/t5/sql-server-blog/introducing-batch-mode-adaptive-joins/ba-p/385411 -- Regards, Andrei Lepikhov
Re: Comment about set_join_pathlist_hook()
On Wed, Sep 20, 2023, at 5:05 PM, Etsuro Fujita wrote: > Hi, > > What I am concerned about from the report [1] is that this comment is > a bit too terse; it might cause a misunderstanding that extensions can > do different things than we intend to allow: > > /* > * 6. Finally, give extensions a chance to manipulate the path list. > */ > if (set_join_pathlist_hook) > set_join_pathlist_hook(root, joinrel, outerrel, innerrel, >jointype, &extra); > > So I would like to propose to extend the comment to explain what they > can do, as in the comment about set_rel_pathlist_hook() in allpaths.c. > Attached is a patch for that. It makes sense. But why do you restrict addition to pathlist by only the add_path() routine? It can fail to add a path to the pathlist. We need to find out the result of the add_path operation and need to check the pathlist each time. So, sometimes, it can be better to add a path manually. One more slip-up could be prevented by the comment: removing a path from the pathlist we should remember about the cheapest_* pointers. Also, it may be good to remind a user, that jointype and extra->sjinfo->jointype aren't the same all the time. > [1] > https://www.postgresql.org/message-id/CACawEhV%3D%2BQ0HXrcDergbTR9EkVFukgRPMTZbRFL-YK5CRmvYag%40mail.gmail.com -- Regards, Andrei Lepikhov
Re: Comment about set_join_pathlist_hook()
On Thu, Sep 21, 2023, at 12:53 PM, Etsuro Fujita wrote: > Hi, > > On Thu, Sep 21, 2023 at 11:49 AM Lepikhov Andrei > wrote: >> On Wed, Sep 20, 2023, at 5:05 PM, Etsuro Fujita wrote: >> > What I am concerned about from the report [1] is that this comment is >> > a bit too terse; it might cause a misunderstanding that extensions can >> > do different things than we intend to allow: >> > >> > /* >> > * 6. Finally, give extensions a chance to manipulate the path list. >> > */ >> > if (set_join_pathlist_hook) >> > set_join_pathlist_hook(root, joinrel, outerrel, innerrel, >> >jointype, &extra); >> > >> > So I would like to propose to extend the comment to explain what they >> > can do, as in the comment about set_rel_pathlist_hook() in allpaths.c. >> > Attached is a patch for that. >> >> It makes sense. But why do you restrict addition to pathlist by only the >> add_path() routine? It can fail to add a path to the pathlist. We need to >> find out the result of the add_path operation and need to check the pathlist >> each time. So, sometimes, it can be better to add a path manually. > > I do not agree with you on this point; I think you can do so at your > own responsibility, but I think it is better for extensions to use > add_path(), because that makes them stable. (Assuming that add_path() > has a bug and we change the logic of it to fix the bug, extensions > that do not follow the standard procedure might not work anymore.) Ok, I got it.This question related to the add_path() interface itself, not to the comment. It is awkward to check every time in the pathlist the result of the add_path. >> One more slip-up could be prevented by the comment: removing a path from the >> pathlist we should remember about the cheapest_* pointers. > > Do we really need to do this? I mean we do set_cheapest() afterward. > See standard_join_search(). Agree, in the case of current join it doesn't make sense. I stuck in this situation because providing additional path at the current level I need to arrange paths for the inner and outer too. Thanks for the explanation! -- Regards, Andrei Lepikhov
Re: [PoC] Reducing planning time when tables have many partitions
On Wed, Sep 20, 2023, at 5:04 PM, Yuya Watari wrote: > On Tue, Sep 19, 2023 at 5:21 PM Andrey Lepikhov > wrote: >> Working on self-join removal in the thread [1] nearby, I stuck into the >> problem, which made an additional argument to work in this new direction >> than a couple of previous ones. >> With indexing positions in the list of equivalence members, we make some >> optimizations like join elimination more complicated - it may need to >> remove some clauses and equivalence class members. >> For changing lists of derives or ec_members, we should go through all >> the index lists and fix them, which is a non-trivial operation. > > Thank you for looking into this and pointing that out. I understand > that this problem will occur somewhere like your patch [1] quoted > below because we need to modify RelOptInfo->eclass_child_members in > addition to ec_members. Is my understanding correct? (Of course, I > know ec_[no]rel_members, but I doubt we need them.) It is okay if we talk about the self-join-removal feature specifically because joins are removed before an inheritance expansion. But ec_source_indexes and ec_derive_indexes point to specific places in eq_sources and eq_derives lists. If I removed an EquivalenceClass or a restriction during an optimisation, I would arrange all indexes, too. Right now, I use a workaround here and remove the index link without removing the element from the list. But I'm not sure how good this approach can be in perspective. So, having eq_sources and eq_derives localised in EC could make such optimisations a bit more simple. -- Regards, Andrei Lepikhov