RE: [PoC] Partition path cache
Hello > This sounds like an interesting idea, I like it because it omit the needs for > "global statistics" effort for partitioned table since it just use the first > partition it knows. Of couse it has its drawback that "first" > partition can't represent other partitions. This method uses global statistics for all partitions. The cache uses standard path building functions (it calculates selectivity for path), but it avoids calling all of them for the second and later partitions in a group. The concept is similar to the GEQO method used for joins. We skip creating some path variants if building all paths would take too long. > One of the Arguments of this patch might be "What if other partitions have a > pretty different statistics from the first partition?". If I were you, I > might check all the used statistics on this stage and try to find out a > similar algorithms to > prove that the best path would be similar too. This > can happens once when the statistics is gathered. However this might be not > easy. Yes, maybe we can split partitions by groups not only by available index lists but also by some statistical property ranges. -- Best Regards Ivan Bykov
Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET
Hi, all! It seems I have found a bug in the Query ID calculation. Problem === In some cases, we could have same IDs for not identical query trees. For two structurally similar query subnodes, we may have query trees that look like this: QueryA->subNodeOne = Value X; QueryA->subNodeTwo = NULL; QueryB->subNodeOne = NULL; QueryB->subNodeTwo = Value X; When the query jumble walker calculates the query ID, it traverses the Query members from top to bottom and generates the same IDs for these two queries because it does not change the hash value when visiting an empty node (= NULL). Here is an example (the collection of jstate->jumble is omitted): /* QueryA_ID = */ QueryA->subNodeOne = &(Value X); /* QueryA_ID = hash_any_extended(&(Value X), sizeof(Value X), QueryA_ID) = */ QueryA->subNodeTwo = NULL; /* QueryA_ID = ; */ /* QueryB_ID = */ QueryB->subNodeOne = NULL; /* QueryB_ID = ; */ QueryB->subNodeTwo = &(Value X); /* QueryB_ID = hash_any_extended(&(Value X), sizeof(Value X), QueryB_ID) = */ There are two pairs of subnodes that can trigger this error: - distinctClause and sortClause (both contain a list of SortGroupClauses); - limitOffset and limitCount (both contain an int8 expression). Here is an example of such errors (all queries run on REL_17_0): SET compute_query_id = on; /* DISTINCT / ORDER BY */ EXPLAIN (VERBOSE) SELECT DISTINCT "oid" FROM pg_class; /* Query Identifier: 751948508603549510 */ EXPLAIN (VERBOSE) SELECT "oid" FROM pg_class ORDER BY "oid"; /* Query Identifier: 751948508603549510 */ /* LIMIT / OFFSET **/ EXPLAIN (VERBOSE) SELECT "oid" FROM pg_class LIMIT 1; /* Query Identifier: 5185884322440896420 */ EXPLAIN (VERBOSE) SELECT "oid" FROM pg_class OFFSET 1; /* Query Identifier: 5185884322440896420 */ Solution One The simplest way to fix the problem is to place the scalar field used in the query ID calculation between similar subnodes. A patch for this solution is attached below (0001-Query-ID-Calculation-Fix-Variant-A.patch). Here is an example: /* QueryA_ID = */ QueryA->subNodeOne = &(Value X); /* QueryA_ID = hash_any_extended(&(Value X), sizeof(Value X), QueryA_ID) = */ QueryA->scalar = Value Y; /* QueryA_ID = hash_any_extended(&(Value Y), sizeof(Value Y), QueryA_ID) = */ QueryA->subNodeTwo = NULL; /* QueryA_ID = ; */ /* QueryB_ID = */ QueryB->subNodeOne = NULL; /* QueryB_ID = ; */ QueryB->scalar = Value Y; /* QueryB_ID = hash_any_extended(&(Value Y), sizeof(Value Y), QueryB_ID) = */ QueryB->subNodeOne = &(Value X); /* QueryB_ID = hash_any_extended(&(Value X), sizeof(Value X), QueryB_ID) = */ Solution Two Alternatively, we can change the hash sum when we encounter an empty node. This approach may impact performance but will protect us from such errors in the future. A patch for this solution is attached below (0001-Query-ID-Calculation-Fix-Variant-B.patch). Here is an example: /* QueryA_ID = */ QueryA->subNodeOne = &(Value X); /* QueryA_ID = hash_any_extended(&(Value Y), sizeof(Value Y), QueryA_ID) = */ QueryA->subNodeTwo = NULL; /* QueryA_ID = hash_any_extended(&('\0'), sizeof(char), QueryA_ID) = */ /* QueryB_ID = */ QueryB->subNodeOne = NULL; /* QueryB_ID = hash_any_extended(&('\0'), sizeof(char), QueryB_ID) = */ QueryB->subNodeOne = &(Value X); /* QueryB_ID = hash_any_extended(&(Value X), sizeof(Value X), QueryB_ID) = */
RE: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET
Hello! Last time, I forgot to attach the patches. The problem still persists in the 17.3 release. Solution One The simplest way to fix the problem is to place the scalar field used in the query ID calculation between similar subnodes. A patch for this solution is attached below (0001-Query-ID-Calculation-Fix-Variant-A.patch). Solution Two Alternatively, we can change the hash sum when we encounter an empty node. This approach may impact performance but will protect us from such errors in the future. A patch for this solution is attached below (0001-Query-ID-Calculation-Fix-Variant-B.patch). == SELECT version(); version - PostgreSQL 17.3 on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit SET compute_query_id = on; /* LIMIT / OFFSET */ EXPLAIN (VERBOSE) SELECT "oid" FROM pg_class LIMIT 1; QUERY PLAN Limit (cost=0.00..0.04 rows=1 width=4) Output: oid -> Seq Scan on pg_catalog.pg_class (cost=0.00..18.15 rows=415 width=4) Output: oid Query Identifier: 5185884322440896420 EXPLAIN (VERBOSE) SELECT "oid" FROM pg_class OFFSET 1; QUERY PLAN Limit (cost=0.04..18.15 rows=414 width=4) Output: oid -> Seq Scan on pg_catalog.pg_class (cost=0.00..18.15 rows=415 width=4) Output: oid Query Identifier: 5185884322440896420 /* DISTINCT / ORDER BY */ EXPLAIN (VERBOSE) SELECT DISTINCT "oid" FROM pg_class; QUERY PLAN Unique (cost=0.27..23.54 rows=415 width=4) Output: oid -> Index Only Scan using pg_class_oid_index on pg_catalog.pg_class (cost=0.27..22.50 rows=415 width=4) Output: oid Query Identifier: 751948508603549510 EXPLAIN (VERBOSE) SELECT "oid" FROM pg_class ORDER BY "oid"; QUERY PLAN -- Index Only Scan using pg_class_oid_index on pg_catalog.pg_class (cost=0.27..22.50 rows=415 width=4) Output: oid Query Identifier: 751948508603549510 0001-Query-ID-Calculation-Fix-Variant-A.patch Description: 0001-Query-ID-Calculation-Fix-Variant-A.patch 0001-Query-ID-Calculation-Fix-Variant-B.patch Description: 0001-Query-ID-Calculation-Fix-Variant-B.patch
RE: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET
Here is 0001-Query-ID-Calculation-Fix-Variant-A.patch and 0001-Query-ID-Calculation-Fix-Variant-B.patch 0001-Query-ID-Calculation-Fix-Variant-B.patch Description: 0001-Query-ID-Calculation-Fix-Variant-B.patch 0001-Query-ID-Calculation-Fix-Variant-A.patch Description: 0001-Query-ID-Calculation-Fix-Variant-A.patch
RE: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET
Hello! >> Variant B is not acceptable IMO as it adds a whole bunch of >> null-terminators unnecessarily. For example, in a simple "select 1", >> (expr == NULL) is true 19 times, so that is an extra 19 bytes. > Variant B is not acceptable here. Could we improve Variant B? I was thinking about adding a count of NULL pointers encountered by the query jumble walker since the last scalar or non-null field visited. This way, we can traverse the query as follows: QueryA->subNodeOne = &(Value X); /* JumbleState->Count = 0; QueryA_ID = hash_any_extended(&(Value X), sizeof(Value X), QueryA_ID) */ QueryA->subNodeTwo = NULL; /* JumbleState->Count = 1; */ QueryA->subNodeThee = NULL; /* JumbleState->Count = 2; */ QueryA->subNodeFour = NULL; /* JumbleState->Count = 3; */ QueryA->scalar = Value Z; /* QueryA_ID = hash_any_extended(&(JumbleState->Count), sizeof(JumbleState->Count), QueryA_ID) JumbleState->Count = 0; QueryA_ID = hash_any_extended(&(Value Y), sizeof(Value Y), QueryA_ID) */ Variant A improvement - I have attached the updated Variant A patch with the following changes: - Implemented a pg_stat_statements regression test (see similar test results on REL_17_3 in Query-ID-collision-at_pg_stat_statements-1ea6e890b22.txt). - Added extra description about the Query ID collision in src/include/nodes/parsenodes.h. -- End of contrib/pg_stat_statements/expected/select.out for REL_17_3 CREATE TABLE pgss_c AS SELECT id FROM generate_series(1,2) AS id; SELECT pg_stat_statements_reset() IS NOT NULL AS t; t --- t (1 row) SELECT DISTINCT id FROM pgss_c; id 2 1 (2 rows) SELECT id FROM pgss_c ORDER BY id; id 1 2 (2 rows) SELECT id FROM pgss_c LIMIT 1; id 1 (1 row) SELECT id FROM pgss_c OFFSET 1; id 2 (1 row) SELECT query FROM pg_stat_statements WHERE query LIKE '%pgss_c%' ORDER BY query COLLATE "C"; query SELECT DISTINCT id FROM pgss_c SELECT id FROM pgss_c LIMIT $1 (2 rows) SELECT pg_stat_statements_reset() IS NOT NULL AS t; t --- t (1 row) SELECT id FROM pgss_c ORDER BY id; id 1 2 (2 rows) SELECT DISTINCT id FROM pgss_c; id 2 1 (2 rows) SELECT id FROM pgss_c OFFSET 1; id 2 (1 row) SELECT id FROM pgss_c LIMIT 1; id 1 (1 row) SELECT query FROM pg_stat_statements WHERE query LIKE '%pgss_c%' ORDER BY query COLLATE "C"; query --- SELECT id FROM pgss_c OFFSET $1 SELECT id FROM pgss_c ORDER BY id (2 rows) SELECT pg_stat_statements_reset() IS NOT NULL AS t; t --- t (1 row) DROP TABLE pgss_c; v2-0001-Query-ID-Calculation-Fix-Variant-A.patch Description: v2-0001-Query-ID-Calculation-Fix-Variant-A.patch
RE: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET
Here is bug description from https://www.postgresql.org/message-id/flat/ca447b72d15745b9a877fad7e258407a%40localhost.localdomain Problem === In some cases, we could have same IDs for not identical query trees. For two structurally similar query subnodes, we may have query trees that look like this: QueryA->subNodeOne = Value X; QueryA->subNodeTwo = NULL; QueryB->subNodeOne = NULL; QueryB->subNodeTwo = Value X; When the query jumble walker calculates the query ID, it traverses the Query members from top to bottom and generates the same IDs for these two queries because it does not change the hash value when visiting an empty node (= NULL). Here is an example (the collection of jstate->jumble is omitted): /* QueryA_ID = */ QueryA->subNodeOne = &(Value X); /* QueryA_ID = hash_any_extended(&(Value X), sizeof(Value X), QueryA_ID) = */ QueryA->subNodeTwo = NULL; /* QueryA_ID = ; */ /* QueryB_ID = */ QueryB->subNodeOne = NULL; /* QueryB_ID = ; */ QueryB->subNodeTwo = &(Value X); /* QueryB_ID = hash_any_extended(&(Value X), sizeof(Value X), QueryB_ID) = */ There are two pairs of subnodes that can trigger this error: - distinctClause and sortClause (both contain a list of SortGroupClauses); - limitOffset and limitCount (both contain an int8 expression). Here is an example of such errors (all queries run on REL_17_0): SET compute_query_id = on; /* DISTINCT / ORDER BY */ EXPLAIN (VERBOSE) SELECT DISTINCT "oid" FROM pg_class; /* Query Identifier: 751948508603549510 */ EXPLAIN (VERBOSE) SELECT "oid" FROM pg_class ORDER BY "oid"; /* Query Identifier: 751948508603549510 */ /* LIMIT / OFFSET **/ EXPLAIN (VERBOSE) SELECT "oid" FROM pg_class LIMIT 1; /* Query Identifier: 5185884322440896420 */ EXPLAIN (VERBOSE) SELECT "oid" FROM pg_class OFFSET 1; /* Query Identifier: 5185884322440896420 */ Solution One The simplest way to fix the problem is to place the scalar field used in the query ID calculation between similar subnodes. A patch for this solution is attached below (0001-Query-ID-Calculation-Fix-Variant-A.patch). Here is an example: /* QueryA_ID = */ QueryA->subNodeOne = &(Value X); /* QueryA_ID = hash_any_extended(&(Value X), sizeof(Value X), QueryA_ID) = */ QueryA->scalar = Value Y; /* QueryA_ID = hash_any_extended(&(Value Y), sizeof(Value Y), QueryA_ID) = */ QueryA->subNodeTwo = NULL; /* QueryA_ID = ; */ /* QueryB_ID = */ QueryB->subNodeOne = NULL; /* QueryB_ID = ; */ QueryB->scalar = Value Y; /* QueryB_ID = hash_any_extended(&(Value Y), sizeof(Value Y), QueryB_ID) = */ QueryB->subNodeOne = &(Value X); /* QueryB_ID = hash_any_extended(&(Value X), sizeof(Value X), QueryB_ID) = */ Solution Two Alternatively, we can change the hash sum when we encounter an empty node. This approach may impact performance but will protect us from such errors in the future. A patch for this solution is attached below (0001-Query-ID-Calculation-Fix-Variant-B.patch). Here is an example: /* QueryA_ID = */ QueryA->subNodeOne = &(Value X); /* QueryA_ID = hash_any_extended(&(Value Y), sizeof(Value Y), QueryA_ID) = */ QueryA->subNodeTwo = NULL; /* QueryA_ID = hash_any_extended(&('\0'), sizeof(char), QueryA_ID) = */ /* QueryB_ID = */ QueryB->subNodeOne = NULL; /* QueryB_ID = hash_any_extended(&('\0'), sizeof(char), QueryB_ID) = */ QueryB->subNodeOne = &(Value X); /* QueryB_ID = hash_any_extended(&(Value X), sizeof(Value X), QueryB_ID) = */ From: Быков Иван Александрович Sent: Thursday, March 6, 2025 7:32 PM To: 'pgsql-hack...@postgresql.org' Subject: RE: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET Hello! Last time, I forgot to attach the patches. The problem still persists in the 17.3 release. Solution One The simplest way to fix the problem is to place the scalar field used in the query ID calculation between similar subnodes. A patch for this solution is attached below (0001-Query-ID-Calculation-Fix-Variant-A.patch). Solution Two Alternatively, we can change the hash sum when we encounter an empty node. This approach may impact performance but will protect us from such errors in the future. A patch for this solution is attached below (0001-Query-ID-Calculation-Fix-Variant-B.patch). == SELECT version(); version - PostgreSQL 17.3 on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit SET compute_query_id = on; /* LIMIT / OFFSET */ EXPLAIN (VERBOSE) SELECT "oid" FROM pg_class LIMIT 1; QUERY PLAN Limit (cost=0.00..0.04 rows=1 width=4) Output: oid -> Seq S
RE: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET
Hello! > Since then, I see that Ivan > has already submitted a patch that accounts for NULL nodes and adds a > byte to the jumble buffer to account for NULLs. This seems quite clean > and simple. However, Sami seems to have concerns about the overhead of > doing this. Is that warranted at all? Potentially, there could be a > decent number of NULL fields. It'll probably be much cheaper than the > offsetof idea I came up with. To address the issue of it consuming a lot of bytes in the jumble buffer for NULL marks, I have added a new patch version for Variant B. (v2-0001-Query-ID-Calculation-Fix-Variant-B.patch). This patch adds to JumbleState the count of NULL nodes visited before a non-NULL one appears. The least significant byte of this count is appended to the jumble buffer every time the count is not null (indicating that we have visited non-NULL nodes previously), and a new chunk of data is also appended to the jumble buffer. I intentionally did not add a final append for the NULL count in the JumbleQuery processing, as NULL tail nodes of the query do not improve the reliability of query ID collision resolution. I believe that my first version, Variant B of the patch, is simple and easy to understand. I would prefer to use the first version of the Variant B patch, so I have attached an updated version along with tests (v3-0001-Query-ID-Calculation-Fix-Variant-B.patch). As you can see, v2-0001-Query-ID-Calculation-Fix-Variant-B.patch is more complex and invasive than v3-0001-Query-ID-Calculation-Fix-Variant-B.patch. I don't think that saving a few bytes in a 1024-byte sized buffer is worth implementing, even for this simple optimization (v2). > Can you share your thoughts on Ivan's NULL jumble idea? I would appreciate any feedback. Thank you. -Original Message- From: David Rowley Sent: Tuesday, March 11, 2025 12:49 PM To: Michael Paquier Cc: Быков Иван Александрович ; Sami Imseih ; pgsql-hack...@postgresql.org Subject: Re: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET On Tue, 11 Mar 2025 at 19:50, Michael Paquier wrote: > > On Tue, Mar 11, 2025 at 12:45:35AM +1300, David Rowley wrote: > > It seems to me that if this fixes the issue, that the next similar > > one is already lurking in the shadows waiting to jump out on us. > > For how many years will be have to wait for this threat hiddent in the > shadows? :) > > This issue exists since the query jumbling exists in pgss, so it goes > really down the road. I've spent a bit more than a few minutes on > that. I didn't mean to cause offence here. I just wanted to make it clear that I don't think fixing these types of issues by swapping the order of the fields is going to be a sustainable practice, and it would be good to come up with a solution that eliminates the chances of this class of bug from ever appearing again. Even if we were to trawl the entire Query struct and everything that could ever be attached to it and we deem that today it's fine and no more such bugs exist, the person adding some new SQL feature in the future that adds new data to store in the Query probably isn't going to give query jumbling much thought, especially now that the code generation is automatic. > > Couldn't we adjust all this code so that we pass a seed to > > AppendJumble() that's the offsetof(Struct, field) that we're > > jumbling and incorporate that seed into the hash? I don't think we > > could possibly change the offset in a minor version, so I don't > > think there's a risk that could cause jumble value instability. > > Possibly different CPU architectures would come up with different > > offsets through having different struct alignment requirements, but > > we don't make any guarantees about the jumble values matching across > > different CPU architectures anyway. > > Yeah, something like that has potential to get rid of such problems > forever, particularly thanks to the fact that we automate the > generation of this code now so it is mostly cost-free. This has a > cost for the custom jumbling functions where one needs some > imagination, but with models being around while we keep the number of > custom functions low, that should be neither complicated nor costly, > at least in my experience. I hadn't really processed this thread fully when I responded yesterday and my response was mostly aimed at the latest patch on the thread at the time, which I had looked at quickly and wanted to put a stop to changing field orders as a fix for this. Since then, I see that Ivan has already submitted a patch that accounts for NULL nodes and adds a byte to the jumble buffer to account for NULLs. This seems quite clean and simple. However, Sami seems to have concerns about the overhead of doing this. Is that warranted at all? Potentially, there could be a decent number of NULL fields. It'll probably be much cheaper than the offsetof idea I came up with. > When I was thinking about the addition
RE: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET
Hello, David! As I can see, your patch has the same idea as my v2-0001-Query-ID-Calculation-Fix-Variant-B.patch from [1]. I think it would be better to extract the jumble buffer update with hash calculation into a function (CompressJumble in my patch). This will result in fewer changes to the source code. We can also simply dump the null count to the buffer when we encounter a non-null field (which will be processed in AppendJumble, indeed). What do your thing about my patch (v2-0001-Query-ID-Calculation-Fix-Variant-B.patch)? [1] https://www.postgresql.org/message-id/5ac172e0b77a4baba50671cd1a15285f%40localhost.localdomain
RE: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET
Hello, Michael! > So, here is attached a counter-proposal, where we can simply added a > counter tracking a node count in _jumbleNode() to add more entropy to > the mix, incrementing it as well for NULL nodes. It definitely looks like a more reliable solution than my variant, which only counts NULL nodes. However, we already knew about the overhead of adding `\0` bytes for every NULL field. > So that adds about 9.1% overhead to jumbling, on average. See: https://www.postgresql.org/message-id/flat/5ac172e0b77a4baba50671cd1a15285f%40localhost.localdomain#6c43f354f5f42d2a27e6824faa660a86 Is it really worth spending extra execution time to increase entropy when we have non-NULL nodes? Maybe we should choose to add node_count to the hash every time we visit non-NULL or NULL nodes. We could also add entropy if we see a change in the node->type value for non-NULL variants. Your Variant < node_count = 1 > < node 1 > < node_count = 2 > /* node 2 = NULL */ < node_count = 3 > < node 3 > Alternative 1 (mark only NULL Nodes) /* node_count = 1 */ < node 1 > < node_count = 2 > /* node 2 = NULL */ /* node_count = 3 */ < node 3 > Alternative 2 (mark only non-NULL Nodes) This could address concerns about problems related to visiting nodes with the same content placed in different query tree branches. < node_count = 1 > < node 1 > /* node_count = 2 */ /* node 2 = NULL */ < node_count = 3 > < node 3 >
RE: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET
Hello, David! > I see you opted to only jumble the low-order byte of the pending null > counter. I used the full 4 bytes as I don't think the extra 3 bytes is > a problem. Witrh v7 the memcpy to copy the pending_nulls into the > buffer is inlined into a single instruction. It's semi-conceivable > that you could have more than 255 NULLs given that a List can contain > Nodes. I don't know if we ever have any NULL items in Lists at the > moment, but I don't think that that's disallowed. I'd feel much safer > taking the full 4 bytes of the counter to help prevent future > surprises. Yes _jumbleList could theoretically handle cases with over 256 Node elements in same list, but most (or all) of them is non-NULL (as I know only PlannedStmt->subplans accepts NULL elements), Most of the foreach cycles over query lists don't have any NULL safety checks; we just assume that the lists contain no NULL pointers. I still believe storing just one byte is sufficient. > I'm happy to proceed with the v7 version. I'm also happy to credit you > As the primary author of it, given that you came up with the v2b > patch. It might be best to extract the performance stuff that's in v7 > and apply that separately from the bug fix. If you're still interested > and have time in the next 7 hours to do it, would you be able to split > the v7 as described, making v8-0001 as the bug fix and v8-0002 as the > performance stuff? If so, could you also add the total_jumble_len to > v8-0001 like Michael's last patch had, but adjust so it's only > maintained for Assert builds. Michael is quite keen on having more > assurance that custom jumble functions don't decide to forego any > jumbling. As I can see, you have already split the patches. I apologize for the delay; I don't have much time to work on this issue. Thank you for your proposal to credit me as the patch author.