Re: Logging which local address was connected to in log_line_prefix
On 5/24/24 22:33, Greg Sabino Mullane wrote: Peter, thank you for the feedback. Attached is a new patch with "address" rather than "interface", plus a new default of "local" if there is no address. I also removed the questionable comment, and updated the commitfest title. I tried the updated patch and it behaved as expected with [local] being logged for peer connections and an IP being logged for host connections. One thing -- the changes in postgresql.conf.sample should use tabs to match the other lines. The patch uses spaces. I also find the formatting in log_status_format() pretty awkward but I guess that will be handled by pgindent. Regards, -David
Address the -Wuse-after-free warning in ATExecAttachPartition()
In [1], Andres reported a -Wuse-after-free bug in the ATExecAttachPartition() function. I've created a patch to address it with pointers from Amit offlist. The issue was that the partBoundConstraint variable was utilized after the list_concat() function. This could potentially lead to accessing the partBoundConstraint variable after its memory has been freed. The issue was resolved by using the return value of the list_concat() function, instead of using the list1 argument of list_concat(). I copied the partBoundConstraint variable to a new variable named partConstraint and used it for the previous references before invoking get_proposed_default_constraint(). I confirmed that the eval_const_expressions(), make_ands_explicit(), map_partition_varattnos(), QueuePartitionConstraintValidation() functions do not modify the memory location pointed to by the partBoundConstraint variable. Therefore, it is safe to use it for the next reference in get_proposed_default_constraint() Attaching the patch. Please review and share the comments if any. Thanks to Andres for spotting the bug and some off-list advice on how to reproduce it. [1]: https://www.postgresql.org/message-id/flat/202311151802.ngj2la66jwgi%40alvherre.pgsql#4fc5622772ba0244c1ad203f5fc56701 Best Regards, Nitin Jadhav Azure Database for PostgreSQL Microsoft v1-0001-Address-the-Wuse-after-free-warning-in-ATExecAttachP.patch Description: Binary data
Re: Pluggable cumulative statistics
Hi, On Mon, Jul 08, 2024 at 03:49:34PM +0900, Michael Paquier wrote: > On Mon, Jul 08, 2024 at 06:39:56AM +, Bertrand Drouvot wrote: > > + for (int kind = PGSTAT_KIND_FIRST_VALID; kind <= PGSTAT_KIND_LAST; > > kind++) > > + { > > + char *ptr; > > + const PgStat_KindInfo *info = pgstat_get_kind_info(kind); > > > > I wonder if we could avoid going through stats that are not fixed ones. > > What about > > doing something like? > > Same would apply for the read part added in 9004abf6206e. > > This becomes more relevant when the custom stats are added, as this > performs a scan across the full range of IDs supported. So this > choice is here for consistency, and to ease the pluggability. Gotcha. > > > 2 === > > > > + pgstat_build_snapshot_fixed(kind); > > + ptr = ((char *) &pgStatLocal.snapshot) + > > info->snapshot_ctl_off; > > + write_chunk(fpout, ptr, info->shared_data_len); > > > > I think that using "shared_data_len" is confusing here (was not the case in > > the > > context of 9004abf6206e). I mean it is perfectly correct but the wording > > "shared" > > looks weird to me when being used here. What it really is, is the size of > > the > > stats. What about renaming shared_data_len with stats_data_len? > > It is the stats data associated to a shared entry. I think that's OK, > but perhaps I'm just used to it as I've been staring at this area for > days. Yeah, what I meant to say is that one could think for example that's the PgStatShared_Archiver size while in fact it's the PgStat_ArchiverStats size. I think it's more confusing when writing the stats. Here we are manipulating "snapshot" and "snapshot" offsets. It was not that confusing when reading as we are manipulating "shmem" and "shared" offsets. As I said, the code is fully correct, that's just the wording here that sounds weird to me in the "snapshot" context. Except the above (which is just a Nit), 0001 LGTM. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
回复:Re: 回复:Re: speed up pg_upgrade with large number of tables
> Thanks! Since you mentioned that you have multiple databases with 1M+ > databases, you might also be interested in commit 2329cad. That should > speed up the pg_dump step quite a bit. Wow, I noticed this commit(2329cad) when it appeared in commitfest. It has doubled the speed of pg_dump in this scenario. Thank you for your effort! Besides, https://commitfest.postgresql.org/48/4995/ seems insufficient to this situation. Some time-consuming functions like check_for_data_types_usage are not yet able to run in parallel. But these patches could be a great starting point for a more efficient parallelism implementation. Maybe we can do it later.
Re: Doc Rework: Section 9.16.13 SQL/JSON Query Functions
another tiny issue. -select json_query(jsonb '{"a": "[1, 2]"}', 'lax $.a' OMIT QUOTES); +JSON_QUERY(jsonb '{"a": "[1, 2]"}', 'lax $.a' OMIT QUOTES); [1, 2] -select json_query(jsonb '{"a": "[1, 2]"}', 'lax $.a' RETURNING int[] OMIT QUOTES ERROR ON ERROR); +JSON_QUERY(jsonb '{"a": "[1, 2]"}', 'lax $.a' RETURNING int[] OMIT QUOTES ERROR ON ERROR); These two example queries don't need semicolons at the end?
Re: Pgoutput not capturing the generated columns
Hi Shlok, Here are some review comments for patch v15-0003. == src/backend/catalog/pg_publication.c 1. publication_translate_columns The function comment says: * Translate a list of column names to an array of attribute numbers * and a Bitmapset with them; verify that each attribute is appropriate * to have in a publication column list (no system or generated attributes, * no duplicates). Additional checks with replica identity are done later; * see pub_collist_contains_invalid_column. That part about "[no] generated attributes" seems to have gone stale -- e.g. not quite correct anymore. Should it say no VIRTUAL generated attributes? == src/backend/replication/logical/proto.c 2. logicalrep_write_tuple and logicalrep_write_attrs I thought all the code fragments like this: + if (att->attgenerated && att->attgenerated != ATTRIBUTE_GENERATED_STORED) + continue; + don't need to be in the code anymore, because of the BitMapSet (BMS) processing done to make the "column list" for publication where disallowed generated cols should already be excluded from the BMS, right? So shouldn't all these be detected by the following statement: if (!column_in_column_list(att->attnum, columns)) continue; == src/backend/replication/logical/tablesync.c 3. + if(server_version >= 12) + { + bool gencols_allowed = server_version >= 17 && MySubscription->includegencols; + + if (gencols_allowed) + { Should say server_version >= 18, instead of 17 == src/backend/replication/pgoutput/pgoutput.c 4. send_relation_and_attrs (this is a similar comment for #2 above) IIUC of the advantages of the BitMapSet (BMS) idea in patch 0001 to process the generated columns up-front means there is no need to check them again in code like this. They should be discovered anyway in the subsequent check: /* Skip this attribute if it's not present in the column list */ if (columns != NULL && !bms_is_member(att->attnum, columns)) continue; == src/test/subscription/t/011_generated.pl 5. AFAICT there are still multiple comments (e.g. for the "TEST tab" comments) where it still says "generated" instead of "stored generated". I did not make a "nitpicks" diff for these because those comments are inherited from the prior patch 0002 which still has outstanding review comments on it too. Please just search/replace them. == Kind Regards, Peter Smith. Fujitsu Australia
Re: Parallel CREATE INDEX for GIN indexes
On Sun, 7 Jul 2024, 18:11 Tomas Vondra, wrote: > > On 7/3/24 20:36, Matthias van de Meent wrote: >> On Mon, 24 Jun 2024 at 02:58, Tomas Vondra >> wrote: >>> So I've switched this to use the regular data-type comparisons, with >>> SortSupport etc. There's a bit more cleanup remaining and testing >>> needed, but I'm not aware of any bugs. >> >> A review of patch 0001: >> >> --- >> >>> src/backend/access/gin/gininsert.c | 1449 +++- >> >> The nbtree code has `nbtsort.c` for its sort- and (parallel) build >> state handling, which is exclusively used during index creation. As >> the changes here seem to be largely related to bulk insertion, how >> much effort would it be to split the bulk insertion code path into a >> separate file? >> > > Hmmm. I haven't tried doing that, but I guess it's doable. I assume we'd > want to do the move first, because it involves pre-existing code, and > then do the patch on top of that. > > But what would be the benefit of doing that? I'm not sure doing it just > to make it look more like btree code is really worth it. Do you expect > the result to be clearer? It was mostly a consideration of file size and separation of concerns. The sorted build path is quite different from the unsorted build, after all. >> I noticed that new fields in GinBuildState do get to have a >> bs_*-prefix, but none of the other added or previous fields of the >> modified structs in gininsert.c have such prefixes. Could this be >> unified? >> > > Yeah, these are inconsistencies from copying the infrastructure code to > make the parallel builds work, etc. > >>> +/* Magic numbers for parallel state sharing */ >>> +#define PARALLEL_KEY_GIN_SHAREDUINT64CONST(0xB001) >>> ... >> >> These overlap with BRIN's keys; can we make them unique while we're at it? >> > > We could, and I recall we had a similar discussion in the parallel BRIN > thread, right?. But I'm somewhat unsure why would we actually want/need > these keys to be unique. Surely, we don't need to mix those keys in the > single shm segment, right? So it seems more like an aesthetic thing. Or > is there some policy to have unique values for these keys? Uniqueness would be mostly useful for debugging shared memory issues, but indeed, in a correctly working system we wouldn't have to worry about parallel state key type confusion. >> --- >> >>> +++ b/src/include/access/gin_tuple.h >>> + typedef struct GinTuple >> >> I think this needs some more care: currently, each GinTuple is at >> least 36 bytes in size on 64-bit systems. By using int instead of Size >> (no normal indexable tuple can be larger than MaxAllocSize), and >> packing the fields better we can shave off 10 bytes; or 12 bytes if >> GinTuple.keylen is further adjusted to (u)int16: a key needs to fit on >> a page, so we can probably safely assume that the key size fits in >> (u)int16. > > Yeah, I guess using int64 is a bit excessive - you're right about that. > I'm not sure this is necessarily about "indexable tuples" (GinTuple is > not indexed, it's more an intermediate representation). Yes, but even if the GinTuple itself isn't stored on disk in the index, a GinTuple's key *is* part of the the primary GIN btree's keys somewhere down the line, and thus must fit on a page somewhere. That's what I was referring to. > But if we can make it smaller, that probably can't hurt. > > I don't have a great intuition on how beneficial this might be. For > cases with many TIDs per index key, it probably won't matter much. But > if there's many keys (so that GinTuples store only very few TIDs), it > might make a difference. This will indeed matter most when small TID lists are common. I suspect it's not uncommon to find us such a in situation when users have a low maintenance_work_mem (and thus don't have much space to buffer and combine index tuples before they're flushed), or when the generated index keys can't be store in the available memory (such as in my example; it didn't tune m_w_m at all, and the table I tested had ~15GB of data). > I'll try to measure the impact on the same "realistic" cases I used for > the earlier steps. That would be greatly appreciated, thanks! Kind regards, Matthias van de Meent Neon (https://neon.tech)
a potential typo in comments of pg_parse_json
Not 100% sure, sorry if this doesn't make sense. --- a/src/common/jsonapi.c +++ b/src/common/jsonapi.c @@ -514,7 +514,7 @@ freeJsonLexContext(JsonLexContext *lex) * * If FORCE_JSON_PSTACK is defined then the routine will call the non-recursive * JSON parser. This is a useful way to validate that it's doing the right - * think at least for non-incremental cases. If this is on we expect to see + * thing at least for non-incremental cases. If this is on we expect to see * regression diffs relating to error messages about stack depth, but no * other differences. */ -- Regards Junwang Zhao
Re: Partial aggregates pushdown
On Sun, 7 Jul 2024 at 23:46, fujii.y...@df.mitsubishielectric.co.jp wrote: > In my mind, Approach1 is superior. Therefore, if there are no objections this > week, I plan to resume implementing Approach1 next week. I would appreciate > it if anyone could discuss the topic with me or ask questions. Honestly, the more I think about this, the more I like Approach2. Not because I disagree with you about some of the limitations of Approach2, but because I'd rather see those limitations fixed in CREATE TYPE, instead of working around these limitations in CREATE AGGREGATE. That way more usages can benefit. Detailed explanation and arguments below. > 1. Extendability > I believe it is crucial to support scenarios where the local and remote major > versions may differ in the future (see the below). > > https://www.postgresql.org/message-id/4012625.1701120204%40sss.pgh.pa.us >From my reading, Tom's concern is that different server versions cannot talk to each other anymore. So as long as this perf optimization is only enabled when server versions are the same, I don't think there is a huge problem if we never implement this. Honestly, I even think making this optimization opt-in at the FDW server creation level would already solve Tom's concert. I do agree that it would be good if we could have cross version partial aggregates though, so it's definitely something to consider. > Regarding this aspect, I consider Approach1 superior to Approach2. The reason > is that: > ・The data type of an aggregate function's state value may change with each > major version increment. > ・In Approach1, by extending the export/import functionalities to include the > major version in which the state value was created (refer to p.16 and p.17 of > [1]), I can handle such situations. > ・On the other hand, it appears that Approach2 fundamentally lacks the > capability to support these scenarios. Approach 2 definitely has some cross-version capabilities, e.g. jsonb_send includes a version. Such an approach can be used to solve a newer coordinator talking to an older worker, if the transtypes are the same. I personally don't think it's worth supporting this optimization for an older coordinator talking to a newer worker. Using binary protocol to talk to from an older server to a newer server doesn't work either. Finally, based on p.16 & p.17 it's unclear to me how cross-version with different transtypes would work. That situation seems inherently incompatible to me. > 2. Amount of codes > Regarding this aspect, I find Approach1 to be better than Approach2. > In Approach1, developers only need to export/import functions and can use a > standardized format for transmitting state values. > In Approach2, developers have two options: > Option1: Adding typinput/typoutput and typsend/typreceive. > Option2: Adding typinput/typoutput only. > Option1 requires more lines of code, which may be seen as cumbersome by some > developers. > Option2 restricts developers to using only text representation for > transmitting state values, which I consider limiting. In my opinion this is your strongest argument for Approach1. But you didn't answer my previous two questions yet: On Tue, 25 Jun 2024 at 11:28, Jelte Fennema-Nio wrote: > So that leaves me two questions: > 1. Maybe CREATE TYPE should allow types without input/output functions > as long as send/receive are defined. For these types text > representation could fall back to the hex representation of bytea. > 2. If for some reason 1 is undesired, then why are transtypes so > special. Why is it fine for them to only have send/receive functions > and not for other types? Basically: I agree with this argument, but I feel like this being a problem for this usecase is probably a sign that we should take the solution a step further and solve this at the CREATE TYPE level instead of allowing people to hack around CREATE TYPE its limitations just for these partial aggregates. > 3. Catalog size > Regarding this point, I believe Approach1 is better than Approach2. > In Approach1, theoretically, it is necessary to add export/import functions > to pg_proc for each aggregate. > In Approach2, theoretically, it is necessary to add typoutput/typinput > functions (and typsend/typreceive if necessary) to pg_proc and add a native > type to pg_type for each aggregate. > I would like to emphasize that we should consider user-defined functions in > addition to built-in aggregate functions. > I think most developers prefer to avoid bloating catalogs, even if they may > not be able to specify exact reasons. > In fact, in Robert's previous review, he expressed a similar concern (see > below). > > https://www.postgresql.org/message-id/CA%2BTgmobvja%2Bjytj5zcEcYgqzOaeJiqrrJxgqDf1q%3D3k8FepuWQ%40mail.gmail.com So, to summarize the difference (assuming we change CREATE TYPE to allow only typsend/typreceive): "Approach 2 adds an additional pg_type entry per aggregate" IMHO this is fine, especially sin
Re: pgsql: Add pg_get_acl() to get the ACL for a database object
On Fri, Jul 05, 2024 at 10:40:39AM +0200, Joel Jacobson wrote: > OK, I made an attempt to implement this, based on adapting code from > recordExtObjInitPriv() in aclchk.c, which retrieves ACL based on ATTNUM. > > There are now two different code paths for columns and non-columns. > > It sounds like a bigger refactoring in this area would be nice, > which would enable cleaning up code in other functions as well, > but maybe that's better to do as a separate project. Thanks for the patch. I have been looking at it for a few hours, eyeing a bit on the ObjectProperty parts a bit if we were to extend it for sub-object IDs, and did not like the complexity this introduces, so I'd be OK to live with the extra handling in pg_get_acl() itself. + /* ignore dropped columns */ + if (atttup->attisdropped) + { + ReleaseSysCache(tup); + PG_RETURN_NULL(); + } Hmm. This is an important bit and did not consider it first. That makes the use of ObjectProperty less flexible because we want to look at the data in the pg_attribute tuple to be able to return NULL in this case. I've tweaked a bit what you are proposing, simplifying the code and removing the first batch of queries in the tests as these were less interesting. How does that look? -- Michael From 07b65a78fb660f6a27ef464870a3e27f788c5e28 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Mon, 8 Jul 2024 16:24:42 +0900 Subject: [PATCH v2] Extend pg_get_acl to handle sub-object IDs. This patch modifies the pg_get_acl function to accept a third objsubid param. This enables the retrieval of ACLs for columns within a table. --- src/include/catalog/pg_proc.dat | 2 +- src/backend/catalog/objectaddress.c | 60 +++- src/test/regress/expected/privileges.out | 32 +++-- src/test/regress/sql/privileges.sql | 12 +++-- doc/src/sgml/func.sgml | 6 +-- 5 files changed, 87 insertions(+), 25 deletions(-) diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index 0bf413fe05..0a0f6aa198 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -6364,7 +6364,7 @@ { oid => '8730', descr => 'get ACL for SQL object', proname => 'pg_get_acl', provolatile => 's', prorettype => '_aclitem', - proargtypes => 'oid oid', proargnames => '{classid,objid}', + proargtypes => 'oid oid int4', proargnames => '{classid,objid,objsubid}', prosrc => 'pg_get_acl' }, { oid => '3839', diff --git a/src/backend/catalog/objectaddress.c b/src/backend/catalog/objectaddress.c index 2983b9180f..fac3922964 100644 --- a/src/backend/catalog/objectaddress.c +++ b/src/backend/catalog/objectaddress.c @@ -4364,19 +4364,19 @@ pg_identify_object_as_address(PG_FUNCTION_ARGS) /* * SQL-level callable function to obtain the ACL of a specified object, given - * its catalog OID and object OID. + * its catalog OID, object OID and sub-object ID. */ Datum pg_get_acl(PG_FUNCTION_ARGS) { Oid classId = PG_GETARG_OID(0); Oid objectId = PG_GETARG_OID(1); + int32 objsubid = PG_GETARG_INT32(2); Oid catalogId; AttrNumber Anum_acl; - Relation rel; - HeapTuple tup; Datum datum; bool isnull; + HeapTuple tup; /* for "pinned" items in pg_depend, return null */ if (!OidIsValid(classId) && !OidIsValid(objectId)) @@ -4391,18 +4391,52 @@ pg_get_acl(PG_FUNCTION_ARGS) if (Anum_acl == InvalidAttrNumber) PG_RETURN_NULL(); - rel = table_open(catalogId, AccessShareLock); - - tup = get_catalog_object_by_oid(rel, get_object_attnum_oid(catalogId), - objectId); - if (!HeapTupleIsValid(tup)) + /* + * If dealing with a relation's attribute (objsubid is set), the ACL is + * retrieved from pg_attribute. + */ + if (classId == RelationRelationId && objsubid != 0) { - table_close(rel, AccessShareLock); - PG_RETURN_NULL(); - } + AttrNumber attnum = (AttrNumber) objsubid; + Form_pg_attribute atttup; - datum = heap_getattr(tup, Anum_acl, RelationGetDescr(rel), &isnull); - table_close(rel, AccessShareLock); + tup = SearchSysCache2(ATTNUM, ObjectIdGetDatum(objectId), + Int16GetDatum(attnum)); + + if (!HeapTupleIsValid(tup)) + PG_RETURN_NULL(); + + atttup = (Form_pg_attribute) GETSTRUCT(tup); + + /* ignore dropped columns */ + if (atttup->attisdropped) + { + ReleaseSysCache(tup); + PG_RETURN_NULL(); + } + + datum = SysCacheGetAttr(ATTNUM, tup, +Anum_pg_attribute_attacl, +&isnull); + ReleaseSysCache(tup); + } + else + { + Relation rel; + + rel = table_open(catalogId, AccessShareLock); + + tup = get_catalog_object_by_oid(rel, get_object_attnum_oid(catalogId), + objectId); + if (!HeapTupleIsValid(tup)) + { + table_close(rel, AccessShareLock); + PG_RETURN_NULL(); + } + + datum = heap_getattr(tup, Anum_acl, RelationGetDescr(rel), &isnull); + table_close(rel, AccessShareLock); + } if (isnull) PG_RETURN_NULL(); diff --git a/src/test/regress/expected/privileges.ou
Re: Partial aggregates pushdown
On Sun, 30 Jun 2024 at 23:42, fujii.y...@df.mitsubishielectric.co.jp wrote: > Instead, I can make PARTIAL_AGGREGATE an unreserved word by placing it after > the FILTER clause, like avg(c1) FILTER (WHERE c2 > 0) PARTIAL_AGGREGATE, and > by marking it as an ASLABEL word like FILTER. > I attached the patch of the method. > If there are no objections, I would like to proceed with the method described > above. > I'd appreciate it if anyone comment the method. I like this approach of using PARTIAL_AGGREGATE in the same place as the FILTER clause.
Re: Partial aggregates pushdown
On Sun, 7 Jul 2024 at 23:52, fujii.y...@df.mitsubishielectric.co.jp wrote: > My plan for advancing the patch involves the following steps: > Step1. Decide the approach on transmitting state value. > Step2. Implement code (including comments) and tests to support a subset of > aggregate functions. > Specifically, I plan to support avg, sum, and other aggregate functions > like min and max which don't need export/import functions. > Step3. Add documentations. > > To clarify my approach, should I proceed with Step 3 before Step2? (my opinion, Bruce might have a different one) I think it's good that you split the original patch in two: 0001: non-internal partial aggregates 0002: internal partial aggregates I think we're aligned on the general design of 0001. So I think now is definitely the time to include documentation there, so we can discuss this patch in more detail, and move it forward. I think generally for 0002 it would also be useful to have documentation, I personally like reading it to understand the general design and then comparing that to the code. But I also understand that the language differences between Japanese and English, makes writing such docs a significant effort for you. So I think it would be fine to skip docs for 0002 for now until we decide on the approach we want to take for internal partial aggregates.
Re: Internal error codes triggered by tests
Hello Michael, 05.07.2024 03:57, Michael Paquier wrote: On Thu, Jul 04, 2024 at 11:00:01AM +0300, Alexander Lakhin wrote: Could you please share your thoughts regarding other error cases, which is not triggered by existing tests, but still can be easily reached by users? For example: SELECT satisfies_hash_partition(1, 1, 0, 0); ERROR: XX000: could not open relation with OID 1 LOCATION: relation_open, relation.c:61 or: CREATE TABLE t (b bytea); INSERT INTO t SELECT ''::bytea; CREATE INDEX brinidx ON t USING brin (b bytea_bloom_ops(n_distinct_per_range = -1.0)); ERROR: XX000: the bloom filter is too large (44629 > 8144) LOCATION: bloom_init, brin_bloom.c:344 Should such cases be corrected too? This is a case-by-case. satisfies_hash_partition() is undocumented, so doing nothing is fine by me. The second one, though is something taht can be triggered with rather normal DDL sequences. That's more annoying. Thank you for the answer! Let me show you other error types for discussion/classification: SELECT pg_describe_object(1::regclass, 0, 0); ERROR: XX000: unsupported object class: 1 LOCATION: getObjectDescription, objectaddress.c:4016 or SELECT pg_identify_object_as_address('1'::regclass, 1, 1); ERROR: XX000: unsupported object class: 1 LOCATION: getObjectTypeDescription, objectaddress.c:4597 -- SELECT format('BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE; SET TRANSACTION SNAPSHOT ''%s''', repeat('-', 1000)) \gexec ERROR: XX000: could not open file "pg_snapshots/-...---" for reading: File name too long LOCATION: ImportSnapshot, snapmgr.c:1428 -- CREATE OPERATOR === (leftarg = int4, rightarg = int4, procedure = int4eq, commutator = ===, hashes); CREATE TABLE t1 (a int); ANALYZE t1; CREATE TABLE t2 (a int); SELECT * FROM t1, t2 WHERE t1.a === t2.a; ERROR: XX000: could not find hash function for hash operator 16385 LOCATION: ExecHashTableCreate, nodeHash.c:560 -- WITH RECURSIVE oq(x) AS ( WITH iq as (SELECT * FROM oq) SELECT * FROM iq UNION SELECT * from iq ) SELECT * FROM oq; ERROR: XX000: missing recursive reference LOCATION: checkWellFormedRecursion, parse_cte.c:896 (commented as "should not happen", but still...) -- CREATE EXTENSION ltree; SELECT '1' ::ltree @ (repeat('!', 100)) ::ltxtquery; ERROR: XX000: stack too short LOCATION: makepol, ltxtquery_io.c:252 -- There is also a couple of dubious ereport(DEBUG1, (errcode(ERRCODE_INTERNAL_ERROR), ...) calls like: /* * User-defined picksplit failed to create an actual split, ie it put * everything on the same side. Complain but cope. */ ereport(DEBUG1, (errcode(ERRCODE_INTERNAL_ERROR), errmsg("picksplit method for column %d of index \"%s\" failed", attno + 1, RelationGetRelationName(r)), I'm not mentioning errors, that require more analysis and maybe correcting the surrounding logic, not ERRCODE only. Maybe it makes sense to separate the discussion of such errors, which are not triggered by tests/not covered; I'm just not sure how to handle them efficiently. Best regards, Alexander
Re: [PATCH] Add min/max aggregate functions to BYTEA
Hi, > What part of commitfest should I put the current patch to: "SQL > Commands", "Miscellaneous" or something else? I can't figure it out. Personally I qualified a similar patch [1] as "Server Features", although I'm not 100% sure if this was the best choice. [1]: https://commitfest.postgresql.org/48/4905/ -- Best regards, Aleksander Alekseev
Re: Removing unneeded self joins
On Thu, Jul 4, 2024 at 11:40 AM jian he wrote: > On Thu, Jul 4, 2024 at 11:04 AM Alexander Korotkov > wrote: > > > > On Thu, Jul 4, 2024 at 5:15 AM jian he wrote: > > > in remove_self_join_rel, i have > > > ```ChangeVarNodes((Node *) root->parse, toRemove->relid, toKeep->relid, > > > 0);``` > > > which will change the joinlist(RangeTblRef) from (1,2) to (2,2). > > > Immediately after this call, I wrote a function (restore_rangetblref) > > > to restore the joinlist as original (1,2). > > > then remove_rel_from_joinlist won't error out. > > > see remove_self_join_rel, restore_rangetblref. > > > > Thank you, now this is clear. Could we add additional parameters to > > ChangeVarNodes() instead of adding a new function which reverts part > > of changes. > > > > I didn't dare to. we have 42 occurrences of ChangeVarNodes. > adding a parameter to it only for one location seems not intuitive. > > Now I have tried. > changing to > `ChangeVarNodes(Node *node, int rt_index, int new_index, int > sublevels_up, bool change_RangeTblRef)` > > /* Replace varno in all the query structures */ > ChangeVarNodes((Node *) root->parse, toRemove->relid, toKeep->relid, 0, > false); > ``` > > it seems to work, pass the regression test. > ```ChangeVarNodes((Node *) root->parse, toRemove->relid, > toKeep->relid, 0, false);``` > is in remove_self_join_rel, remove_self_joins_one_group, > remove_self_joins_recurse. > all other places are ```ChangeVarNodes((Node *) root->parse, > toRemove->relid, toKeep->relid, 0, true);``` > so ChangeVarNodes add a parameter will only influence the SJE feature. Good. But I think it's not necessary to to replace function signature in all the 42 occurrences. This will make our patch unnecessarily conflict with others. Instead we can have two functions ChangeVarNodes(original function signature) and ChangeVarNodesExtended(extended function signature). Then existing occurrences can still use ChangeVarNodes(), which will be just shortcut for ChangeVarNodesExtended(). -- Regards, Alexander Korotkov Supabase
Re: Adminpack removal
Le 01/07/2024 à 10:07, Daniel Gustafsson a écrit : On 28 Jun 2024, at 09:06, Philippe BEAUDOIN wrote: So just looking in public repo covers probably less than 1% of the code. However, this may give a first idea, especialy if a feature use is already detected. Searching for anything on Github is essentially a dead end since it reports so many duplicates in forks etc. That being said, I did a lot of searching and browsing to find users [0], but came up empty (apart from forks which already maintain their own copy). A more targeted search is the Debian Code search which at the time of removal (and well before then) showed zero occurrences of adminpack functions in any packaged software, and no extensions which had adminpack as a dependency. While not an exhaustive search by any means, it does provide a good hint. Since you list no other extensions using adminpack to support keeping it, I assume you also didn't find any when searching? I just said that there are much much more code in private repos (so not analyzable) than in the public ones. -- Daniel Gustafsson [0] https://www.postgresql.org/message-id/B07CC211-DE35-4AC5-BD4E-0C6466700B06%40yesql.se
Re: MIN/MAX functions for a record
Hi, > Many thanks. Here is the corrected patch. Now it also includes MIN() > support and tests. Michael Paquier (cc:'ed) commented offlist that I forgot to change the documentation. Here is the corrected patch. -- Best regards, Aleksander Alekseev v3-0001-Support-min-record-and-max-record-aggregates.patch Description: Binary data
Re: XID formatting and SLRU refactorings (was: Add 64-bit XIDs into PostgreSQL 15)
Hi, > Here is the corrected patchset. TWIMC this is currently listed as an open item for PG17 [1]. Sorry if everyone interested is already aware. [1]: https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items -- Best regards, Aleksander Alekseev
Re: Use pgBufferUsage for block reporting in analyze
Hi, Thanks for the review, I've updated the patches with the suggestions: - moved renaming of misses to reads to the first patch - added intermediate variables for total blks usage I've also done some additional tests using the provided vacuum_analyze_buffer_usage.sql script. It relies on pg_stat_statements to check the results (only pgss gives information on dirtied buffers). It gives the following output: psql:vacuum_analyze_buffer_usage.sql:21: INFO: vacuuming "postgres.pg_temp_7.vacuum_blks_stat_test" ... buffer usage: 105 hits, 3 reads, 6 dirtied ... query | sum_hit | sum_read | sum_dirtied +-+--+- VACUUM (VERBOSE... | 105 |3 | 6 For vacuum, we have the same results with SKIP_DATABASE_STATS. Without this setting, we would have block usage generated by vac_update_datfrozenxid outside of vacuum_rel and therefore not tracked by the verbose output. For the second test, the second patch is needed to have ANALYZE (VERBOSE) output the block usage. It will output the following: psql:vacuum_analyze_buffer_usage.sql:29: INFO: analyzing "pg_temp_7.vacuum_blks_stat_test" ... buffer usage: 84 hits, 33 reads, 2 dirtied ... query| sum_hit | sum_read | sum_dirtied -+-+--+- ANALYZE (VERBOSE... | 91 | 38 | 2 There's additional buffer hits/reads reported by pgss, those are from analyze_rel opening the relations in try_relation_open and are not tracked by the ANALYZE VERBOSE. v3-0001-Use-pgBufferUsage-for-block-reporting-in-analyze.patch Description: Binary data v3-0002-Output-buffer-and-wal-usage-with-verbose-analyze.patch Description: Binary data vacuum_analyze_buffer_usage.sql Description: Binary data
Re: Interrupts vs signals
On 08/07/2024 05:56, Thomas Munro wrote: Here's an updated version of this patch. The main idea is that SendProcSignal(pid, PROCSIGNAL_XXX, procno) becomes SendInterrupt(INTERRUPT_XXX, procno), and all the pending interrupt global variables and pss_procsignalFlags[] go away, along with the SIGUSR1 handler. The interrupts are compressed into a single bitmap. See commit message for more details. The patch is failing on Windows CI for reasons I haven't debugged yet, but I wanted to share what I have so far. Work in progress! Here is my attempt to survey the use of signals and write down what I think we might do about them all so far, to give the context for this patch: https://wiki.postgresql.org/wiki/Signals Comments, corrections, edits very welcome. Nice, thanks! Background worker state notifications are also changed from raw kill(SIGUSR1) to SetLatch(). That means that SetLatch() is now called from the postmaster. The main purpose of including that change is to be able to remove procsignal_sigusr1_handler completely and set SIGUSR1 to SIG_IGN, and show the system working. XXX Do we need to invent SetLatchRobust() that doesn't trust anything in shared memory, to be able to set latches from the postmaster? The patch actually does both: it still does kill(SIGUSR1) and also sets the latch. I think it would be nice if RegisterDynamicBackgroundWorker() had a "bool notify_me" argument, instead of requiring the caller to set "bgw_notify_pid = MyProcPid" before the call. That's a backwards-compatibility break, but maybe we should bite the bullet and do it. Or we could do this in RegisterDynamicBackgroundWorker(): if (worker->bgw_notify_pid == MyProcPid) worker->bgw_notify_pgprocno = MyProcNumber; I think we can forbid setting pgw_notify_pid to anything else than 0 or MyProcPid. A SetLatchRobust would be nice. Looking at SetLatch(), I don't think it can do any damage if you called it on a pointer to garbage, except if the pointer itself is bogus, then just dereferencing it an cause a segfault. So it would be nice to have a version specifically designed with that in mind. For example, it could assume that the Latch's pid is never legally equal to MyProcPid, because postmaster cannot own any latches. Another approach would be to move the responsibility of background worker state notifications out of postmaster completely. When a new background worker is launched, the worker process itself could send the notification that it has started. And similarly, when a worker exits, it could send the notification just before exiting. There's a little race condition with exiting: if a process is waiting for the bgworker to exit, and launches a new worker immediately when the old one exits, there will be a brief period when the old and new process are alive at the same time. The old worker wouldn't be doing anything interesting anymore since it's exiting, but it still counts towards max_worker_processes, so launching the new process might fail because of hitting the limit. Maybe we should just bump up max_worker_processes. Or postmaster could check PMChildFlags and not count processes that have already deregistered from PMChildFlags towards the limit. -volatile uint32 InterruptHoldoffCount = 0; -volatile uint32 QueryCancelHoldoffCount = 0; -volatile uint32 CritSectionCount = 0; +uint32 InterruptHoldoffCount = 0; +uint32 QueryCancelHoldoffCount = 0; +uint32 CritSectionCount = 0; I wondered if these are used in PG_TRY-CATCH blocks in a way that would still require volatile. I couldn't find any such usage by some quick grepping, so I think we're good, but I thought I'd mention it. +/* + * The set of "standard" interrupts that CHECK_FOR_INTERRUPTS() and + * ProcessInterrupts() handle. These perform work that is safe to run whenever + * interrupts are not "held". Other kinds of interrupts are only handled at + * more restricted times. + */ +#define INTERRUPT_STANDARD_MASK \ Some interrupts are missing from this mask: - INTERRUPT_PARALLEL_APPLY_MESSAGE - INTERRUPT_IDLE_STATS_UPDATE_TIMEOUT - INTERRUPT_SINVAL_CATCHUP - INTERRUPT_NOTIFY Is that on purpose? -/* - * Because backends sitting idle will not be reading sinval events, we - * need a way to give an idle backend a swift kick in the rear and make - * it catch up before the sinval queue overflows and forces it to go - * through a cache reset exercise. This is done by sending - * PROCSIG_CATCHUP_INTERRUPT to any backend that gets too far behind. - * - * The signal handler will set an interrupt pending flag and will set the - * processes latch. Whenever starting to read from the client, or when - * interrupted while doing so, ProcessClientReadInterrupt() will call - * ProcessCatchupEvent(). - */ -volatile sig_atomic_t catchupInterruptPending = false; Would be nice to move that comment somewhere else rather than remove it completely. --- a/src/backe
RE: Conflict Detection and Resolution
On Monday, July 8, 2024 12:32 PM Zhijie Hou (Fujitsu) wrote: > > I researched about how to detect the resolve update_deleted and thought > about one idea: which is to maintain the xmin in logical slot to preserve > the dead row and support latest_timestamp_xmin resolution for > update_deleted to maintain data consistency. > > Here are details of the xmin idea and resolution of update_deleted: > > 1. how to preserve the dead row so that we can detect update_delete >conflict correctly. (In the following explanation, let's assume there is a >a multimeter setup with node A, B). > > To preserve the dead row on node A, I think we could maintain the "xmin" > in the logical replication slot on Node A to prevent the VACCUM from > removing the dead row in user table. The walsender that acquires the slot > is responsible to advance the xmin. (Node that I am trying to explore > xmin idea as it could be more efficient than using commit_timestamp, and the > logic could be simpler as we are already maintaining catalog_xmin in > logical slot and xmin in physical slot) > > - Strategy for advancing xmin: > > The xmin can be advanced if a) a transaction (xid:1000) has been flushed > to the remote node (Node B in this case). *AND* b) On Node B, the local > transactions that happened before applying the remote > transaction(xid:1000) were also sent and flushed to the Node A. > > - The implementation: > > condition a) can be achieved with existing codes, the walsender can > advance the xmin similar to the catalog_xmin. > > For condition b), we can add a subscription option (say 'feedback_slot'). > The feedback_slot indicates the replication slot that will send changes to > the origin (On Node B, the slot should be subBA). The apply worker will > check the status(confirmed flush lsn) of the 'feedback slot' and send > feedback to the walsender about the WAL position that has been sent and > flushed via the feedback_slot. The above are some initial thoughts of how to preserve the dead row for update_deleted conflict detection. After thinking more, I have identified a few additional cases that I missed to analyze regarding the design. One aspect that needs more thoughts is the possibility of multiple slots on each node. In this scenario, the 'feedback_slot' subscription option would need to be structured as a list. However, requiring users to specify all the slots may not be user-friendly. I will explore if this process can be automated. In addition, I will think more about the potential impact of re-using the existing 'xmin' of the slot which may affect existing logic that relies on 'xmin'. I will analyze more and reply about these points. Best Regards, Hou zj
Re: Incorrect results from numeric round() and trunc()
On Mon, 8 Jul 2024 at 00:40, Joel Jacobson wrote: > > On Sun, Jul 7, 2024, at 13:28, Dean Rasheed wrote: > > I've also tidied up a bit by replacing all instances of SHRT_MAX with > > a new constant NUMERIC_WEIGHT_MAX, whose name more accurately > > describes the limit, as used in various other overflow checks. > > Having thought a bit more on this, I think we probably need a > DEC_DIGITS sensitive definition of NUMERIC_WEIGHT_MAX, > since per spec the max range for numeric is 0x2 (131072) > decimal digits. > No, the maximum weight is determined by the use of int16 to store the weight. Therefore if you did reduce DEC_DIGITS to 1 or 2, the number of decimal digits allowed before the decimal point would be reduced too. Regards, Dean
Re: Parallel CREATE INDEX for GIN indexes
On Sun, 7 Jul 2024, 18:26 Tomas Vondra, wrote: > > On 7/5/24 21:45, Matthias van de Meent wrote: >> On Wed, 3 Jul 2024 at 20:36, Matthias van de Meent >> wrote: >>> >>> On Mon, 24 Jun 2024 at 02:58, Tomas Vondra >>> wrote: So I've switched this to use the regular data-type comparisons, with SortSupport etc. There's a bit more cleanup remaining and testing needed, but I'm not aware of any bugs. >>> --- +++ b/src/backend/utils/sort/tuplesortvariants.c >>> >>> I was thinking some more about merging tuples inside the tuplesort. I >>> realized that this could be implemented by allowing buffering of tuple >>> writes in writetup. This would require adding a flush operation at the >>> end of mergeonerun to store the final unflushed tuple on the tape, but >>> that shouldn't be too expensive. This buffering, when implemented >>> through e.g. a GinBuffer in TuplesortPublic->arg, could allow us to >>> merge the TID lists of same-valued GIN tuples while they're getting >>> stored and re-sorted, thus reducing the temporary space usage of the >>> tuplesort by some amount with limited overhead for other >>> non-deduplicating tuplesorts. >>> >>> I've not yet spent the time to get this to work though, but I'm fairly >>> sure it'd use less temporary space than the current approach with the >>> 2 tuplesorts, and could have lower overall CPU overhead as well >>> because the number of sortable items gets reduced much earlier in the >>> process. >> >> I've now spent some time on this. Attached the original patchset, plus >> 2 incremental patches, the first of which implement the above design >> (patch no. 8). >> >> Local tests show it's significantly faster: for the below test case >> I've seen reindex time reduced from 777455ms to 626217ms, or ~20% >> improvement. >> >> After applying the 'reduce the size of GinTuple' patch, index creation >> time is down to 551514ms, or about 29% faster total. This all was >> tested with a fresh stock postgres configuration. >> >> """ >> CREATE UNLOGGED TABLE testdata >> AS SELECT sha256(i::text::bytea)::text >> FROM generate_series(1, 1500) i; >> CREATE INDEX ON testdata USING gin (sha256 gin_trgm_ops); >> """ >> > > Those results look really promising. I certainly would not have expected > such improvements - 20-30% speedup on top of the already parallel run > seems great. I'll do a bit more testing to see how much this helps for > the "more realistic" data sets. > > I can't say I 100% understand how the new stuff in tuplesortvariants.c > works, but that's mostly because my knowledge of tuplesort internals is > fairly limited (and I managed to survive without that until now). > > What makes me a bit unsure/uneasy is that this seems to "inject" custom > code fairly deep into the tuplesort logic. I'm not quite sure if this is > a good idea ... I thought it was still fairly high-level: it adds (what seems to me) a natural extension to the pre-existing "write a tuple to the tape" API, allowing the Tuplesort (TS) implementation to further optimize its ordered tape writes through buffering (and combining) of tuple writes. While it does remove the current 1:1 relation of TS tape writes to tape reads for the GIN case, there is AFAIK no code in TS that relies on that 1:1 relation. As to the GIN TS code itself; yes it's more complicated, mainly because it uses several optimizations to reduce unnecessary allocations and (de)serializations of GinTuples, and I'm aware of even more such optimizations that can be added at some point. As an example: I suspect the use of GinBuffer in writetup_index_gin to be a significant resource drain in my patch because it lacks "freezing" in the tuplesort buffer. When no duplicate key values are encountered, the code is nearly optimal (except for a full tuple copy to get the data into the GinBuffer's memory context), but if more than one GinTuple has the same key in the merge phase we deserialize both tuple's posting lists and merge the two. I suspect that merge to be more expensive than operating on the compressed posting lists of the GinTuples themselves, so that's something I think could be improved. I suspect/guess it could save another 10% in select cases, and will definitely reduce the memory footprint of the buffer. Another thing that can be optimized is the current approach of inserting data into the index: I think it's kind of wasteful to decompress and later re-compress the posting lists once we start storing the tuples on disk. Kind regards, Matthias van de Meent
Re: Improving the latch handling between logical replication launcher and worker processes.
On Fri, Jul 5, 2024 at 6:38 PM Heikki Linnakangas wrote: > > > Your solution with an additional latch seems better because it makes > WaitForReplicationWorkerAttach() react more quickly, without the 10 s > wait. I'm surprised we have that in the first place, 10 s seems like a > pretty long time to wait for a parallel apply worker to start. Why was > that ever OK? > Isn't the call wait for 10 milliseconds? The comment atop WaitLatch("The "timeout" is given in milliseconds...) indicates the timeout is in milliseconds. -- With Regards, Amit Kapila.
Re: pgsql: Add pg_get_acl() to get the ACL for a database object
On Mon, Jul 8, 2024, at 10:34, Michael Paquier wrote: > Thanks for the patch. I have been looking at it for a few hours, > eyeing a bit on the ObjectProperty parts a bit if we were to extend it > for sub-object IDs, and did not like the complexity this introduces, > so I'd be OK to live with the extra handling in pg_get_acl() itself. > > + /* ignore dropped columns */ > + if (atttup->attisdropped) > + { > + ReleaseSysCache(tup); > + PG_RETURN_NULL(); > + } > > Hmm. This is an important bit and did not consider it first. That > makes the use of ObjectProperty less flexible because we want to look > at the data in the pg_attribute tuple to be able to return NULL in > this case. > > I've tweaked a bit what you are proposing, simplifying the code and > removing the first batch of queries in the tests as these were less > interesting. How does that look? Thanks, nice simplifications. I agree the tests you removed are not that interesting. Looks good to me. Patch didn't apply to HEAD nor on top of any of the previous commits either, but I couldn't figure out why based on the .rej files, strange. I copy/pasted the parts from the patch to test it. Let me know if you need a rebased version of it, in case you will need to do the same, to save some work. Also noted the below in your last commit: a6417078c414 has introduced as project policy that new features committed during the development cycle should use new OIDs in the [8000,] range. 4564f1cebd43 did not respect that rule, so let's renumber pg_get_acl() to use an OID in the correct range. Thanks for the info! Will make sure to use the `src/include/catalog/renumber_oids.pl` tool for future patches. Regards, Joel
Re: Incorrect results from numeric round() and trunc()
On Mon, Jul 8, 2024, at 11:45, Dean Rasheed wrote: > On Mon, 8 Jul 2024 at 00:40, Joel Jacobson wrote: >> >> On Sun, Jul 7, 2024, at 13:28, Dean Rasheed wrote: >> > I've also tidied up a bit by replacing all instances of SHRT_MAX with >> > a new constant NUMERIC_WEIGHT_MAX, whose name more accurately >> > describes the limit, as used in various other overflow checks. >> >> Having thought a bit more on this, I think we probably need a >> DEC_DIGITS sensitive definition of NUMERIC_WEIGHT_MAX, >> since per spec the max range for numeric is 0x2 (131072) >> decimal digits. >> > > No, the maximum weight is determined by the use of int16 to store the > weight. Therefore if you did reduce DEC_DIGITS to 1 or 2, the number > of decimal digits allowed before the decimal point would be reduced > too. OK, that can actually be seen as a feature, especially since it's of course more likely DEC_DIGITS could increase in the future than decrease. For example, let's say we would double it to 8, then if NUMERIC_WEIGHT_MAX would still be 0x7FFF (32767), then the maximum range for numeric would increase from 131072 to 262144 decimal digits allowed before the decimal point. LGTM. Regards, Joel
Re: Address the -Wuse-after-free warning in ATExecAttachPartition()
On Mon, Jul 8, 2024 at 3:22 PM Nitin Jadhav wrote: > > In [1], Andres reported a -Wuse-after-free bug in the > ATExecAttachPartition() function. I've created a patch to address it > with pointers from Amit offlist. > > The issue was that the partBoundConstraint variable was utilized after > the list_concat() function. This could potentially lead to accessing > the partBoundConstraint variable after its memory has been freed. > > The issue was resolved by using the return value of the list_concat() > function, instead of using the list1 argument of list_concat(). I > copied the partBoundConstraint variable to a new variable named > partConstraint and used it for the previous references before invoking > get_proposed_default_constraint(). I confirmed that the > eval_const_expressions(), make_ands_explicit(), > map_partition_varattnos(), QueuePartitionConstraintValidation() > functions do not modify the memory location pointed to by the > partBoundConstraint variable. Therefore, it is safe to use it for the > next reference in get_proposed_default_constraint() > > Attaching the patch. Please review and share the comments if any. > Thanks to Andres for spotting the bug and some off-list advice on how > to reproduce it. The patch LGTM. Curious how to reproduce the bug ;) > > [1]: > https://www.postgresql.org/message-id/flat/202311151802.ngj2la66jwgi%40alvherre.pgsql#4fc5622772ba0244c1ad203f5fc56701 > > Best Regards, > Nitin Jadhav > Azure Database for PostgreSQL > Microsoft -- Regards Junwang Zhao
Re: Test to dump and restore objects left behind by regression
On Fri, Jul 5, 2024 at 10:59 AM Michael Paquier wrote: > On Fri, Jun 28, 2024 at 06:00:07PM +0530, Ashutosh Bapat wrote: > > Here's a description of patches and some notes > > 0001 > > --- > > 1. Per your suggestion the logic to handle dump output differences is > > externalized in PostgreSQL::Test::AdjustDump. Instead of eliminating > those > > differences altogether from both the dump outputs, the corresponding DDL > in > > the original dump output is adjusted to look like that from the restored > > database. Thus we retain full knowledge of what differences to expect. > > 2. I have changed the name filter_dump to filter_dump_for_upgrade so as > to > > differentiate between two adjustments 1. for upgrade and 2. for > > dump/restore. Ideally the name should have been > adjust_dump_for_ugprade() . > > It's more of an adjustment than filtering as indicated by the function it > > calls. But I haven't changed that. The new function to adjust dumps for > > dump and restore tests is named adjust_dump_for_restore() however. > > 3. As suggested by Daniel upthread, the test for dump and restore happens > > before upgrade which might change the old cluster thus changing the state > > of objects left behind by regression. The test is not executed if > > regression is not used to create the old cluster. > > 4. The code to compare two dumps and report differences if any is moved > to > > its own function compare_dumps() which is used for both upgrade and > > dump/restore tests. > > The test uses the custom dump format for dumping and restoring the > > database. > > At quick glance, that seems to be going in the right direction. Note > that you have forgotten install and uninstall rules for the new .pm > file. > Before submitting the patch, I looked for all the places which mention AdjustUpgrade or AdjustUpgrade.pm to find places where the new module needs to be mentioned. But I didn't find any. AdjustUpgrade is not mentioned in src/test/perl/Makefile or src/test/perl/meson.build. Do we want to also add AdjustUpgrade.pm in those files? > > 0002 increases more the runtime of a test that's already one of the > longest ones in the tree is not really appealing, I am afraid. > We could forget 0002. I am fine with that. But I can change the code such that formats other than "plain" are tested when PG_TEST_EXTRAS contains "regress_dump_formats". Would that be acceptable? -- Best Wishes, Ashutosh Bapat
Re: tests fail on windows with default git settings
Hi On Sun, 7 Jul 2024 at 07:07, Andres Freund wrote: > Hi, > > On 2024-07-07 01:26:13 -0400, Tom Lane wrote: > > Andres Freund writes: > > > Do we want to support checking out with core.autocrlf? > > > > -1. It would be a constant source of breakage, and you could never > > expect that (for example) making a tarball from such a checkout > > would match anyone else's results. > > WFM. > > > > > If we do not want to support that, ISTM we ought to raise an error > somewhere? > > > > +1, if we can figure out how. > > I can see two paths: > > 1) we prevent eol conversion, by using the right magic incantation in >.gitattributes > > 2) we check that some canary file is correctly encoded, e.g. during meson >configure (should suffice, this is realistically only a windows issue) > > > It seems that the only realistic way to achieve 1) is to remove the "text" > attribute from all files. That had me worried for a bit, thinking that > might > have a larger blast radius. However, it looks like this is solely used for > line-ending conversion. The man page says: > "This attribute marks the path as a text file, which enables end-of-line > conversion:" > > > Which sounds like it'd work well - except that it appears to behave oddly > when > updating to such a change in an existing repo - > > cd /tmp/; > rm -rf pg-eol; > git -c core.eol=crlf -c core.autocrlf=true clone ~/src/postgresql pg-eol; > cd pg-eol; > git config core.eol crlf; git config core.autocrlf true; > stat src/test/modules/test_json_parser/tiny.json -> 6748 bytes > > cd ~/src/postgresql > stat src/test/modules/test_json_parser/tiny.json -> 6604 bytes > echo '* -text' >> .gitattributes > git commit -a -m tmp > > cd /tmp/pg-eol > git pull > git st > ... > nothing to commit, working tree clean > stat src/test/modules/test_json_parser/tiny.json -> 6748 bytes > > I.e. the repo still is in CRLF state. > > But if I reclone at that point, the line endings are in a sane state. > > > IIUC this is because line-ending conversion is done only during > checkout/checkin. > > > There are ways to get git to redo the normalization, but it's somewhat > awkward [1]. > Yeah, I vaguely remember playing with core.autocrlf many years ago and running into similar issues. > > OTOH, given that the tests already fail, I assume our windows contributors > already have disabled autocrlf? > I can't speak for others of course, but at least as far as building of installers is concerned, we use tarballs not git checkouts. For my own work; well, I've started playing with PG17 on Windows just in the last month or so and have noticed a number of test failures as well as a weird meson issue that only shows up on a Github actions runner. I was hoping to look into those issues this week as I've been somewhat sidetracked with other work of late. -- Dave Page pgAdmin: https://www.pgadmin.org PostgreSQL: https://www.postgresql.org EDB: https://www.enterprisedb.com
array_in sub function ReadArrayDimensions error message
while reviewing the json query doc, I found out the following error message was not quite right. select '[1,2]'::int[]; ERROR: malformed array literal: "[1,2]" LINE 1: select '[1,2]'::int[]; ^ DETAIL: Missing "]" after array dimensions. should it be: "Missing delimiter ":" while specifying array dimensions." ? Because here, the first problem is the comma should be colon. also "DETAIL: Missing "]" after array dimensions." should be DETAIL: Missing "]" while specifying array dimensions. ?
Re: Thoughts on NBASE=100000000
On Sun, 7 Jul 2024, 22:40 Joel Jacobson, wrote: > > Hello hackers, > > I'm not hopeful this idea will be fruitful, but maybe we can find solutions > to the problems together. > > The idea is to increase the numeric NBASE from 1e4 to 1e8, which could > possibly > give a significant performance boost of all operations across the board, > on 64-bit architectures, for many inputs. > > Last time numeric's base was changed was back in 2003, when d72f6c75038 > changed > it from 10 to 1. Back then, 32-bit architectures were still dominant, > so base-1 was clearly the best choice at this time. > > Today, since 64-bit architectures are dominant, NBASE=1e8 seems like it would > have been the best choice, since the square of that still fits in > a 64-bit signed int. Back then 64-bit was by far not as dominant (server and consumer chips with AMD64 ISA only got released that year after the commit), so I don't think 1e8 would have been the best choice at that point in time. Would be better now, yes, but not back then. > Changing NBASE might seem impossible at first, due to the existing numeric > data > on disk, and incompatibility issues when numeric data is transferred on the > wire. > > Here are some ideas on how to work around some of these: > > - Incrementally changing the data on disk, e.g. upon UPDATE/INSERT > and supporting both NBASE=1e4 (16-bit) and NBASE=1e8 (32-bit) > when reading data. I think that a dynamic decision would make more sense here. At low precision, the overhead of 4+1 bytes vs 2 bytes is quite significant. This sounds important for overall storage concerns, especially if the padding bytes (mentioned below) are added to indicate types. > - Due to the lack of a version field in the NumericVar struct, > we need a way to detect if a Numeric value on disk uses > the existing NBASE=1e4, or NBASE=1e8. > One hack I've thought about is to exploit the fact that NUMERIC_NBYTES, > defined as: > #define NUMERIC_NBYTES(num) (VARSIZE(num) - NUMERIC_HEADER_SIZE(num)) > will always be divisible by two, since a NumericDigit is an int16 (2 bytes). > The idea is then to let "NUMERIC_NBYTES divisible by three" > indicate NBASE=1e8, at the cost of one to three extra padding bytes. Do you perhaps mean NUMERIC_NBYTES *not divisible by 2*, i.e. an uneven NUMERIC_NBYTES as indicator for NBASE=1e8, rather than only multiples of 3? I'm asking because there are many integers divisible by both 2 and 3 (all integer multiples of 6; that's 50% of the multiples of 3), so with the multiple-of-3 scheme we might need up to 5 pad bytes to get to the next multiple of 3 that isn't also a multiple of 2. Additionally, if the last digit woud've fit in NBASE_1e4, then the 1e8-based numeric value could even be 7 bytes larger than the equivalent 1e4-based numeric. While I don't think this is worth implementing for general usage, it could be worth exploring for the larger numeric values, where the relative overhead of the larger representation is lower. Kind regards, Matthias van de Meent Neon (https://neon.tech)
Is it expected behavior index only scan shows "OPERATOR(pg_catalog." for EXPLAIN?
Hi, While I'm researching about [1], I found there are inconsistent EXPLAIN outputs. Here is an example which shows " OPERATOR(pg_catalog.". Though it's not wrong, I feel like there is no consistency in the output format. -- A reproduce procedure create temp table btree_bpchar (f1 text collate "C"); create index on btree_bpchar(f1 bpchar_ops); insert into btree_bpchar values ('foo'), ('fool'), ('bar'), ('quux'); set enable_seqscan to false; set enable_bitmapscan to false; set enable_indexonlyscan to false; -- or true explain (costs off) select * from btree_bpchar where f1::bpchar like 'foo'; -- Index Scan result QUERY PLAN -- Index Scan using btree_bpchar_f1_idx on btree_bpchar Index Cond: ((f1)::bpchar = 'foo'::bpchar) Filter: ((f1)::bpchar ~~ 'foo'::text) (3 rows) -- Index Only Scan result which has 'OPERATOR' QUERY PLAN --- Index Only Scan using btree_bpchar_f1_idx on btree_bpchar Index Cond: (f1 OPERATOR(pg_catalog.=) 'foo'::bpchar)-- Here is the point. Filter: ((f1)::bpchar ~~ 'foo'::text) (3 rows) IIUC, the index only scan use fixed_indexquals, which is removed "RelabelType" nodes, for EXPLAIN so that get_rule_expr() could not understand the left argument of the operator (f1 if the above case) can be displayed with arg::resulttype and it doesn't need to show "OPERATOR(pg_catalog.)". I've attached PoC patch to show a simple solution. It just adds a new member "indexqualorig" to the index only scan node like the index scan and the bitmap index scan. But, since I'm a beginner about the planner, I might have misunderstood something or there should be better ways. BTW, I'm trying to add a new index AM interface for EXPLAIN on the thread([1]). As the aspect, my above solution might not be ideal because AMs can only know index column ids (varattno) from fixed_indexquals. In that case, to support "fixed_indexquals" as argument of deparse_expression() is better. [1] Improve EXPLAIN output for multicolumn B-Tree Index https://www.postgresql.org/message-id/flat/TYWPR01MB1098260B694D27758FE2BA46FB1C92%40TYWPR01MB10982.jpnprd01.prod.outlook.com Regards, -- Masahiro Ikeda NTT DATA CORPORATION v1-0001-Fix-inconsistent-explain-output-for-index-only-sc.patch Description: v1-0001-Fix-inconsistent-explain-output-for-index-only-sc.patch
Re: Is it expected behavior index only scan shows "OPERATOR(pg_catalog." for EXPLAIN?
On 7/8/24 13:03, masahiro.ik...@nttdata.com wrote: > Hi, > > While I'm researching about [1], I found there are inconsistent EXPLAIN > outputs. > Here is an example which shows " OPERATOR(pg_catalog.". Though it's not wrong, > I feel like there is no consistency in the output format. > > -- A reproduce procedure > create temp table btree_bpchar (f1 text collate "C"); > create index on btree_bpchar(f1 bpchar_ops); > insert into btree_bpchar values ('foo'), ('fool'), ('bar'), ('quux'); > set enable_seqscan to false; > set enable_bitmapscan to false; > set enable_indexonlyscan to false; -- or true > explain (costs off) > select * from btree_bpchar where f1::bpchar like 'foo'; > > -- Index Scan result > QUERY PLAN > -- > Index Scan using btree_bpchar_f1_idx on btree_bpchar >Index Cond: ((f1)::bpchar = 'foo'::bpchar) >Filter: ((f1)::bpchar ~~ 'foo'::text) > (3 rows) > > -- Index Only Scan result which has 'OPERATOR' > QUERY PLAN > --- > Index Only Scan using btree_bpchar_f1_idx on btree_bpchar >Index Cond: (f1 OPERATOR(pg_catalog.=) 'foo'::bpchar)-- Here is > the point. >Filter: ((f1)::bpchar ~~ 'foo'::text) > (3 rows) > This apparently comes from generate_operator_name() in ruleutils.c, where the OPERATOR() decorator is added if: /* * The idea here is to schema-qualify only if the parser would fail to * resolve the correct operator given the unqualified op name with the * specified argtypes. */ So clearly, the code believes just the operator name could be ambiguous, so it adds the namespace too. Why exactly it is considered ambiguous I don't know, but perhaps you have other applicable operators in the search_path, or something like that? > > IIUC, the index only scan use fixed_indexquals, which is removed > "RelabelType" nodes, > for EXPLAIN so that get_rule_expr() could not understand the left argument of > the operator > (f1 if the above case) can be displayed with arg::resulttype and it doesn't > need to > show "OPERATOR(pg_catalog.)". > > I've attached PoC patch to show a simple solution. It just adds a new member > "indexqualorig" > to the index only scan node like the index scan and the bitmap index scan. > But, since I'm > a beginner about the planner, I might have misunderstood something or there > should be better > ways. > I honestly don't know if this is the correct solution. It seems to me handling this at the EXPLAIN level might just mask the issue - it's not clear to me why adding "indexqualorig" would remove the ambiguity (if there's one). Perhaps it might be better to find why the ruleutils.c code thinks the OPERATOR() is necessary, and then improve/fix that? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Parallel CREATE INDEX for GIN indexes
On 7/8/24 11:45, Matthias van de Meent wrote: > On Sun, 7 Jul 2024, 18:26 Tomas Vondra, wrote: >> >> On 7/5/24 21:45, Matthias van de Meent wrote: >>> On Wed, 3 Jul 2024 at 20:36, Matthias van de Meent >>> wrote: On Mon, 24 Jun 2024 at 02:58, Tomas Vondra wrote: > So I've switched this to use the regular data-type comparisons, with > SortSupport etc. There's a bit more cleanup remaining and testing > needed, but I'm not aware of any bugs. --- > +++ b/src/backend/utils/sort/tuplesortvariants.c I was thinking some more about merging tuples inside the tuplesort. I realized that this could be implemented by allowing buffering of tuple writes in writetup. This would require adding a flush operation at the end of mergeonerun to store the final unflushed tuple on the tape, but that shouldn't be too expensive. This buffering, when implemented through e.g. a GinBuffer in TuplesortPublic->arg, could allow us to merge the TID lists of same-valued GIN tuples while they're getting stored and re-sorted, thus reducing the temporary space usage of the tuplesort by some amount with limited overhead for other non-deduplicating tuplesorts. I've not yet spent the time to get this to work though, but I'm fairly sure it'd use less temporary space than the current approach with the 2 tuplesorts, and could have lower overall CPU overhead as well because the number of sortable items gets reduced much earlier in the process. >>> >>> I've now spent some time on this. Attached the original patchset, plus >>> 2 incremental patches, the first of which implement the above design >>> (patch no. 8). >>> >>> Local tests show it's significantly faster: for the below test case >>> I've seen reindex time reduced from 777455ms to 626217ms, or ~20% >>> improvement. >>> >>> After applying the 'reduce the size of GinTuple' patch, index creation >>> time is down to 551514ms, or about 29% faster total. This all was >>> tested with a fresh stock postgres configuration. >>> >>> """ >>> CREATE UNLOGGED TABLE testdata >>> AS SELECT sha256(i::text::bytea)::text >>> FROM generate_series(1, 1500) i; >>> CREATE INDEX ON testdata USING gin (sha256 gin_trgm_ops); >>> """ >>> >> >> Those results look really promising. I certainly would not have expected >> such improvements - 20-30% speedup on top of the already parallel run >> seems great. I'll do a bit more testing to see how much this helps for >> the "more realistic" data sets. >> >> I can't say I 100% understand how the new stuff in tuplesortvariants.c >> works, but that's mostly because my knowledge of tuplesort internals is >> fairly limited (and I managed to survive without that until now). >> >> What makes me a bit unsure/uneasy is that this seems to "inject" custom >> code fairly deep into the tuplesort logic. I'm not quite sure if this is >> a good idea ... > > I thought it was still fairly high-level: it adds (what seems to me) a > natural extension to the pre-existing "write a tuple to the tape" API, > allowing the Tuplesort (TS) implementation to further optimize its > ordered tape writes through buffering (and combining) of tuple writes. > While it does remove the current 1:1 relation of TS tape writes to > tape reads for the GIN case, there is AFAIK no code in TS that relies > on that 1:1 relation. > > As to the GIN TS code itself; yes it's more complicated, mainly > because it uses several optimizations to reduce unnecessary > allocations and (de)serializations of GinTuples, and I'm aware of even > more such optimizations that can be added at some point. > > As an example: I suspect the use of GinBuffer in writetup_index_gin to > be a significant resource drain in my patch because it lacks > "freezing" in the tuplesort buffer. When no duplicate key values are > encountered, the code is nearly optimal (except for a full tuple copy > to get the data into the GinBuffer's memory context), but if more than > one GinTuple has the same key in the merge phase we deserialize both > tuple's posting lists and merge the two. I suspect that merge to be > more expensive than operating on the compressed posting lists of the > GinTuples themselves, so that's something I think could be improved. I > suspect/guess it could save another 10% in select cases, and will > definitely reduce the memory footprint of the buffer. > Another thing that can be optimized is the current approach of > inserting data into the index: I think it's kind of wasteful to > decompress and later re-compress the posting lists once we start > storing the tuples on disk. > I need to experiment with this a bit more, to better understand the behavior and pros/cons. But one thing that's not clear to me is why would this be better than simply increasing the amount of memory for the initial BuildAccumulator buffer ... Wouldn't that have pretty much the same effect? regards -- Tomas Vondra EnterpriseD
Re: Thoughts on NBASE=100000000
On Mon, Jul 8, 2024, at 12:45, Matthias van de Meent wrote: > On Sun, 7 Jul 2024, 22:40 Joel Jacobson, wrote: >> Today, since 64-bit architectures are dominant, NBASE=1e8 seems like it would >> have been the best choice, since the square of that still fits in >> a 64-bit signed int. > > Back then 64-bit was by far not as dominant (server and consumer chips > with AMD64 ISA only got released that year after the commit), so I > don't think 1e8 would have been the best choice at that point in time. > Would be better now, yes, but not back then. Oh, grammar mistake by me! I meant to say it "would be the best choice", in line with what I wrote above: >> Last time numeric's base was changed was back in 2003, when d72f6c75038 >> changed >> it from 10 to 1. Back then, 32-bit architectures were still dominant, >> so base-1 was clearly the best choice at this time. >> Changing NBASE might seem impossible at first, due to the existing numeric >> data >> on disk, and incompatibility issues when numeric data is transferred on the >> wire. >> >> Here are some ideas on how to work around some of these: >> >> - Incrementally changing the data on disk, e.g. upon UPDATE/INSERT >> and supporting both NBASE=1e4 (16-bit) and NBASE=1e8 (32-bit) >> when reading data. > > I think that a dynamic decision would make more sense here. At low > precision, the overhead of 4+1 bytes vs 2 bytes is quite significant. > This sounds important for overall storage concerns, especially if the > padding bytes (mentioned below) are added to indicate types. Right, I agree. Another idea: It seems possible to reduce the disk space for numerics that fit into one byte, i.e. 0 <= val <= 255, which could be communicated via NUMERIC_NBYTES=1. At least the value 0 should be quite common. >> - Due to the lack of a version field in the NumericVar struct, >> we need a way to detect if a Numeric value on disk uses >> the existing NBASE=1e4, or NBASE=1e8. >> One hack I've thought about is to exploit the fact that NUMERIC_NBYTES, >> defined as: >> #define NUMERIC_NBYTES(num) (VARSIZE(num) - NUMERIC_HEADER_SIZE(num)) >> will always be divisible by two, since a NumericDigit is an int16 (2 bytes). >> The idea is then to let "NUMERIC_NBYTES divisible by three" >> indicate NBASE=1e8, at the cost of one to three extra padding bytes. > > Do you perhaps mean NUMERIC_NBYTES *not divisible by 2*, i.e. an > uneven NUMERIC_NBYTES as indicator for NBASE=1e8, rather than only > multiples of 3? Oh, yes of course! Thinko. > While I don't think this is worth implementing for general usage, it > could be worth exploring for the larger numeric values, where the > relative overhead of the larger representation is lower. Yes, I agree it's definitively seems like a win for larger numeric values. Not sure about smaller numeric values, maybe it's possible to improve upon. Regards, Joel
Re: doc: modify the comment in function libpqrcv_check_conninfo()
On 2024-07-08 15:28, Fujii Masao wrote: On 2024/07/01 18:15, Jelte Fennema-Nio wrote: On Thu, 27 Jun 2024 at 12:27, ikedarintarof wrote: Thanks for your suggestion. I used ChatGPT to choose the wording, but it's still difficult for me. Looks good to me now (but obviously biased since you took my wording). LGTM, too. * Validate connection info string, and determine whether it might cause * local filesystem access to be attempted. The later part of the above comment for libpqrcv_check_conninfo() seems unclear. My understanding is that if must_use_password is true and this function completes without error, the password must be in the connection string, so there's no need to read the .pgpass file (i.e., no local filesystem access). That part seems to be trying to explaing something like that. Is this correct? Anyway, wouldn't it be better to either remove this unclear part or rephrase it for clarity? Regards, Thank you for your comment. I agree "local filesystem access" is vague. I think the reference to .pgpass (local file system) is not necessary in the comment for libpqrcv_check_conninfo() because it is explained in libpqrcv_connect() as shown below. /* * Re-validate connection string. The validation already happened at DDL * time, but the subscription owner may have changed. If we don't recheck * with the correct must_use_password, it's possible that the connection * will obtain the password from a different source, such as PGPASSFILE or * PGPASSWORD. */ libpqrcv_check_conninfo(conninfo, must_use_password); I remove the unclear part from the previous patch and add some explanation for later part of the function. Or, I think it is also good to make the comment more simple: * The function checks that * 1. connection info string is properly parsed and * 2. a password is provided if must_use_password is true. * If the check is failed, the it will raise an error. */ static void libpqrcv_check_conninfo(const char *conninfo, bool must_use_password) Regards, Rintaro IkedaFrom 514bd81f7d1b7104e6d2d92cb072a84096cd3885 Mon Sep 17 00:00:00 2001 From: Rintaro Ikeda <51394766+rinik...@users.noreply.github.com> Date: Thu, 27 Jun 2024 16:04:08 +0900 Subject: [PATCH v2] modify the commnet in libpqrcv_check_conninfo() --- .../replication/libpqwalreceiver/libpqwalreceiver.c| 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c index 02f12f2921..e6a4e4cc1c 100644 --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c @@ -305,12 +305,14 @@ bad_connection: } /* - * Validate connection info string, and determine whether it might cause - * local filesystem access to be attempted. + * The function + * 1. validates connection info string and + * 2. checks a password is provided if must_use_password is true. * * If the connection string can't be parsed, this function will raise - * an error and will not return. If it can, it will return true if this - * connection string specifies a password and false otherwise. + * an error. If must_use_password is true, the function raises an error + * if no password is provided in the connection string. In any other case + * it successfully completes. */ static void libpqrcv_check_conninfo(const char *conninfo, bool must_use_password) -- 2.39.3 (Apple Git-146)
Re: Doc Rework: Section 9.16.13 SQL/JSON Query Functions
On Fri, Jul 5, 2024 at 10:16 PM Erik Rijkers wrote: > Op 7/5/24 om 14:35 schreef Amit Langote: > > Hi Jian, > > > > Thanks for the reviews. > > > > [v3-0001-SQL-JSON-Various-improvements-to-SQL-JSON-query-f.patch] >i.e., from the patch for doc/src/sgml/func.sgml > > > Small changes: > > 4x: > 'a SQL' should be > 'an SQL' > ('a SQL' does never occur in the docs; it's always 'an SQL'; apperently > the 'sequel' pronunciation is out) > > 'some other type to which can be successfully coerced' > 'some other type to which it can be successfully coerced' > > > 'specifies the behavior behavior' > 'specifies the behavior' > > > In the following sentence: > > "By default, the result is returned as a value of type jsonb, > though the RETURNING clause can be used to return > the original jsonb value as some other type to which it > can be successfully coerced." > > it seems to me that this phrase is better removed: > "the original jsonb value as" Thanks, I've addressed all these in the next patch I'll send. -- Thanks, Amit Langote
Re: Slow catchup of 2PC (twophase) transactions on replica in LR
On Mon, Jul 8, 2024 at 12:34 PM Hayato Kuroda (Fujitsu) wrote: > > > Another possible problem is related to my use case. I haven't reproduced > > this > > case, just some thoughts. I guess, when two_phase is ON, the PREPARE > > statement > > may be truncated from the WAL at checkpoint, but COMMIT PREPARED is still > > kept > > in the WAL. On catchup, I would ask the master to send transactions from > > some > > restart LSN. I would like to get all such transactions competely, with > > theirs > > bodies, not only COMMIT PREPARED messages. > > I don't think it is a real issue. WALs for prepared transactions will retain > until they are committed/aborted. > When the two_phase is on and transactions are PREPAREd, they will not be > cleaned up from the memory (See ReorderBufferProcessTXN()). Then, > RUNNING_XACT > record leads to update the restart_lsn of the slot but it cannot be move > forward > because ReorderBufferGetOldestTXN() returns the prepared transaction (See > SnapBuildProcessRunningXacts()). restart_decoding_lsn of each transaction, > which > is a candidate of restart_lsn of the slot. is always behind the startpoint of > its txn. > I see that in 0003/0004, the patch first aborts pending prepared transactions, update's catalog, and then change slot's property via walrcv_alter_slot. What if there is any ERROR (say the remote node is not reachable or there is an error while updating the catalog) after we abort the pending prepared transaction? Won't we end up with lost prepared transactions in such a case? Few other comments: = The code to handle SUBOPT_TWOPHASE_COMMIT should be after failover option handling for the sake of code symmetry. Also, the checks should be in same order like first for slot_name, then enabled, then for PreventInTransactionBlock(), after those, we can have other checks for two_phase. If possible, we can move common checks in both failover and two_phase options into a common function. What should be the behavior if one tries to set slot_name to NONE and also tries to toggle two_pahse option? I feel both options together don't makes sense because there is no use in changing two_phase for some slot which we are disassociating the subscription from. The same could be said for the failover option as well, so if we agree with some different behavior here, we can follow the same for failover option as well. -- With Regards, Amit Kapila.
Re: ssl tests fail due to TCP port conflict
Hello, 07.06.2024 17:25, Tom Lane wrote: Andrew Dunstan writes: I still think my patch to force TCP mode for the SSL test makes sense as well. +1 to both things. If that doesn't get the failure rate down to an acceptable level, we can look at the retry idea. I'd like to add that the kerberos/001_auth test also suffers from the port conflict, but slightly differently. Look for example at [1]: krb5kdc.log contains: Jul 02 09:29:41 andres-postgres-buildfarm-v5 krb5kdc[471964](info): setting up network... Jul 02 09:29:41 andres-postgres-buildfarm-v5 krb5kdc[471964](Error): Address already in use - Cannot bind server socket on 127.0.0.1.55853 Jul 02 09:29:41 andres-postgres-buildfarm-v5 krb5kdc[471964](Error): Failed setting up a UDP socket (for 127.0.0.1.55853) Jul 02 09:29:41 andres-postgres-buildfarm-v5 krb5kdc[471964](Error): Address already in use - Error setting up network As far as I can see, the port for kdc is chosen by PostgreSQL::Test::Kerberos, via PostgreSQL::Test::Cluster::get_free_port(), which checks only for TCP port availability (with can_bind()), but not for UDP, so this increases the probability of the conflict for this test (a similar failure: [2]). Although we can also find a failure with TCP: [3] (It's not clear to me, what processes can use UDP ports while testing, but maybe those buildfarm animals are running on the same logical machine simultaneously?) [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2024-07-02%2009%3A27%3A15 [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2024-05-15%2001%3A25%3A07 [3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2024-07-04%2008%3A28%3A19 Best regards, Alexander
Re: Pgoutput not capturing the generated columns
On Mon, 8 Jul 2024 at 13:20, Peter Smith wrote: > > Hi Shlok, Here are some review comments for patch v15-0003. > > == > src/backend/catalog/pg_publication.c > > 1. publication_translate_columns > > The function comment says: > * Translate a list of column names to an array of attribute numbers > * and a Bitmapset with them; verify that each attribute is appropriate > * to have in a publication column list (no system or generated attributes, > * no duplicates). Additional checks with replica identity are done later; > * see pub_collist_contains_invalid_column. > > That part about "[no] generated attributes" seems to have gone stale > -- e.g. not quite correct anymore. Should it say no VIRTUAL generated > attributes? Yes, we should use VIRTUAL generated attributes, I have modified it. > == > src/backend/replication/logical/proto.c > > 2. logicalrep_write_tuple and logicalrep_write_attrs > > I thought all the code fragments like this: > > + if (att->attgenerated && att->attgenerated != ATTRIBUTE_GENERATED_STORED) > + continue; > + > > don't need to be in the code anymore, because of the BitMapSet (BMS) > processing done to make the "column list" for publication where > disallowed generated cols should already be excluded from the BMS, > right? > > So shouldn't all these be detected by the following statement: > if (!column_in_column_list(att->attnum, columns)) > continue; The current BMS logic do not handle the Virtual Generated Columns. There can be cases where we do not want a virtual generated column but it would be present in BMS. To address this I have added the above logic. I have added this logic similar to the checks of 'attr->attisdropped'. > == > src/backend/replication/pgoutput/pgoutput.c > > 4. send_relation_and_attrs > > (this is a similar comment for #2 above) > > IIUC of the advantages of the BitMapSet (BMS) idea in patch 0001 to > process the generated columns up-front means there is no need to check > them again in code like this. > > They should be discovered anyway in the subsequent check: > /* Skip this attribute if it's not present in the column list */ > if (columns != NULL && !bms_is_member(att->attnum, columns)) > continue; Same explanation as above. I have addressed all the comments in v16-0003 patch. Please refer [1]. [1]: https://www.postgresql.org/message-id/CANhcyEXw%3DBFFVUqohWES9EPkdq-ZMC5QRBVQqQPzrO%3DQ7uzFQw%40mail.gmail.com Thanks and Regards, Shlok Kyal
RE: Partial aggregates pushdown
Hi Jelte, Thank you for comments and advises. > From: Jelte Fennema-Nio > Sent: Monday, July 8, 2024 5:31 PM > On Sun, 7 Jul 2024 at 23:46, fujii.y...@df.mitsubishielectric.co.jp > wrote: > > In my mind, Approach1 is superior. Therefore, if there are no objections > > this week, I plan to resume implementing > Approach1 next week. I would appreciate it if anyone could discuss the topic > with me or ask questions. > > Honestly, the more I think about this, the more I like Approach2. Not because > I disagree with you about some of the > limitations of Approach2, but because I'd rather see those limitations fixed > in CREATE TYPE, instead of working around > these limitations in CREATE AGGREGATE. That way more usages can benefit. > Detailed explanation and arguments below. Firstly, I may have jumped to conclusions too quickly. I apologize that. I would appreciate it if we clarify Approach 1 and Approach 2 more precisely and we can proceed with the discussion. Before we get into the details, let me break down the main differences between Approach 1 and Approach 2. The best thing about Approach2 is that it lets us send state values using the existing data type system. I'm worried that if we choose Approach2, we might face some limits because we have to create new types. But, we might be able to fix these limits if we look into it more. Approach1 doesn't make new types, so we can avoid these limits. But, it means we have to make export/import functions that are similar to the typsend/typreceive functions. So, we need to make sure if we really need this method. Is this the right understanding? > From: Jelte Fennema-Nio > Sent: Monday, July 8, 2024 5:59 PM > On Sun, 30 Jun 2024 at 23:42, fujii.y...@df.mitsubishielectric.co.jp > wrote: > > Instead, I can make PARTIAL_AGGREGATE an unreserved word by placing it > > after the FILTER clause, like avg(c1) FILTER > (WHERE c2 > 0) PARTIAL_AGGREGATE, and by marking it as an ASLABEL word like > FILTER. > > I attached the patch of the method. > > If there are no objections, I would like to proceed with the method > > described above. > > I'd appreciate it if anyone comment the method. > > I like this approach of using PARTIAL_AGGREGATE in the same place as the > FILTER clause. Thank you for comment. > On Sun, 7 Jul 2024 at 23:52, fujii.y...@df.mitsubishielectric.co.jp > wrote: > > My plan for advancing the patch involves the following steps: > > Step1. Decide the approach on transmitting state value. > > Step2. Implement code (including comments) and tests to support a subset > > of aggregate functions. > > Specifically, I plan to support avg, sum, and other aggregate functions > > like min and max which don't need > export/import functions. > > Step3. Add documentations. > > > > To clarify my approach, should I proceed with Step 3 before Step2? > > (my opinion, Bruce might have a different one) > > I think it's good that you split the original patch in two: > 0001: non-internal partial aggregates > 0002: internal partial aggregates > > I think we're aligned on the general design of 0001. So I think now is > definitely the time to include documentation there, so > we can discuss this patch in more detail, and move it forward. > > I think generally for 0002 it would also be useful to have documentation, I > personally like reading it to understand the > general design and then comparing that to the code. But I also understand > that the language differences between Japanese > and English, makes writing such docs a significant effort for you. So I think > it would be fine to skip docs for 0002 for now > until we decide on the approach we want to take for internal partial > aggregates. At least for 0001, it seems like it would be a good idea to attach a document at this stage. Best regards, Yuki Fujii -- Yuki Fujii Information Technology R&D Center, Mitsubishi Electric Corporation
Re: Improving the latch handling between logical replication launcher and worker processes.
On Fri, 5 Jul 2024 at 18:38, Heikki Linnakangas wrote: > > On 05/07/2024 14:07, vignesh C wrote: > > On Thu, 4 Jul 2024 at 16:52, Heikki Linnakangas wrote: > >> > >> I'm don't quite understand the problem we're trying to fix: > >> > >>> Currently the launcher's latch is used for the following: a) worker > >>> process attach b) worker process exit and c) subscription creation. > >>> Since this same latch is used for multiple cases, the launcher process > >>> is not able to handle concurrent scenarios like: a) Launcher started a > >>> new apply worker and waiting for apply worker to attach and b) create > >>> subscription sub2 sending launcher wake up signal. In this scenario, > >>> both of them will set latch of the launcher process, the launcher > >>> process is not able to identify that both operations have occurred 1) > >>> worker is attached 2) subscription is created and apply worker should > >>> be started. As a result the apply worker does not get started for the > >>> new subscription created immediately and gets started after the > >>> timeout of 180 seconds. > >> > >> I don't see how we could miss a notification. Yes, both events will set > >> the same latch. Why is that a problem? > > > > The launcher process waits for the apply worker to attach at > > WaitForReplicationWorkerAttach function. The launcher's latch is > > getting set concurrently for another create subscription and apply > > worker attached. The launcher now detects the latch is set while > > waiting at WaitForReplicationWorkerAttach, it resets the latch and > > proceed to the main loop and waits for DEFAULT_NAPTIME_PER_CYCLE(as > > the latch has already been reset). Further details are provided below. > > > > The loop will see that the new > >> worker has attached, and that the new subscription has been created, and > >> process both events. Right? > > > > Since the latch is reset at WaitForReplicationWorkerAttach, it skips > > processing the create subscription event. > > > > Slightly detailing further: > > In the scenario when we execute two concurrent create subscription > > commands, first CREATE SUBSCRIPTION sub1, followed immediately by > > CREATE SUBSCRIPTION sub2. > > In a few random scenarios, the following events may unfold: > > After the first create subscription command(sub1), the backend will > > set the launcher's latch because of ApplyLauncherWakeupAtCommit. > > Subsequently, the launcher process will reset the latch and identify > > the addition of a new subscription in the pg_subscription list. The > > launcher process will proceed to request the postmaster to start the > > apply worker background process (sub1) and request the postmaster to > > notify the launcher once the apply worker(sub1) has been started. > > Launcher will then wait for the apply worker(sub1) to attach in the > > WaitForReplicationWorkerAttach function. > > Meanwhile, the second CREATE SUBSCRIPTION command (sub2) which was > > executed concurrently, also set the launcher's latch(because of > > ApplyLauncherWakeupAtCommit). > > Simultaneously when the launcher remains in the > > WaitForReplicationWorkerAttach function waiting for apply worker of > > sub1 to start, the postmaster will start the apply worker for > > subscription sub1 and send a SIGUSR1 signal to the launcher process > > via ReportBackgroundWorkerPID. Upon receiving this signal, the > > launcher process will then set its latch. > > > > Now, the launcher's latch has been set concurrently after the apply > > worker for sub1 is started and the execution of the CREATE > > SUBSCRIPTION sub2 command. > > > > At this juncture, the launcher, which had been awaiting the attachment > > of the apply worker, detects that the latch is set and proceeds to > > reset it. The reset operation of the latch occurs within the following > > section of code in WaitForReplicationWorkerAttach: > > ... > > rc = WaitLatch(MyLatch, > > WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, > > 10L, WAIT_EVENT_BGWORKER_STARTUP); > > > > if (rc & WL_LATCH_SET) > > { > > ResetLatch(MyLatch); > > CHECK_FOR_INTERRUPTS(); > > } > > ... > > > > After resetting the latch here, the activation signal intended to > > start the apply worker for subscription sub2 is no longer present. The > > launcher will return to the ApplyLauncherMain function, where it will > > await the DEFAULT_NAPTIME_PER_CYCLE, which is 180 seconds, before > > processing the create subscription request (i.e., creating a new apply > > worker for sub2). > > The issue arises from the latch being reset in > > WaitForReplicationWorkerAttach, which can occasionally delay the > > synchronization of table data for the subscription. > > Ok, I see it now. Thanks for the explanation. So the problem isn't using > the same latch for different purposes per se. It's that we're trying to > use it in a nested fashion, resetting it in the inner loop. > > Looking at the proposed patch more closely: > > > @@ -221,13 +224,13 @@ WaitForReplicationWorkerAttach(Logi
Re: Partial aggregates pushdown
On Mon, 8 Jul 2024 at 14:12, fujii.y...@df.mitsubishielectric.co.jp wrote: > The best thing about Approach2 is that it lets us send state values using the > existing data type system. > I'm worried that if we choose Approach2, we might face some limits because we > have to create new types. > But, we might be able to fix these limits if we look into it more. > > Approach1 doesn't make new types, so we can avoid these limits. > But, it means we have to make export/import functions that are similar to the > typsend/typreceive functions. > So, we need to make sure if we really need this method. > > Is this the right understanding? Yeah, correct. To clarify my reasoning a bit more: IMHO, the main downside of implementing Approach1 is that we then end up with two different mechanisms to "take data from memory and serialize it in a way in which it can be sent over the network". I'd very much prefer if we could have a single system responsible for that task. So if there's issues with the current system (e.g. having to implement typinput/typoutput), then I'd rather address these problems in the existing system. Only if that turns out to be impossible for some reason, then I think I'd prefer Approach1. Personally, even if the Approach2 requires a bit more code, then I'd still prefer a single serialization system over having two serializiation systems.
Re: Improving the latch handling between logical replication launcher and worker processes.
On Mon, Jul 8, 2024 at 5:47 PM vignesh C wrote: > > On Fri, 5 Jul 2024 at 18:38, Heikki Linnakangas wrote: > > > > > As an alternative, smaller fix, I think we could do the attached. It > > forces the launcher's main loop to do another iteration, if it calls > > logicalrep_worker_launch(). That extra iteration should pick up any > > missed notifications. > > This also works. > The minor drawback would be that in many cases the extra iteration would not lead to anything meaningful. -- With Regards, Amit Kapila.
Re: Wrong security context for deferred triggers?
On 7/8/24 04:14, Joseph Koshakow wrote: Given the above and the fact that the patch is a breaking change, my vote would still be to keep the current behavior and update the documentation. Though I'd be happy to be overruled by someone with more knowledge of triggers. Thanks for that feedback. Based on that, the patch should be rejected. Since there were a couple of other opinions early in the thread, I'll let it sit like that for now, and judgement can be passed at the end of the commitfest. Perhaps somebody else wants to chime in. Yours, Laurenz Albe
Re: Is it expected behavior index only scan shows "OPERATOR(pg_catalog." for EXPLAIN?
Tomas Vondra writes: > I honestly don't know if this is the correct solution. It seems to me > handling this at the EXPLAIN level might just mask the issue - it's not > clear to me why adding "indexqualorig" would remove the ambiguity (if > there's one). Perhaps it might be better to find why the ruleutils.c > code thinks the OPERATOR() is necessary, and then improve/fix that? As Masahiro-san already said, the root cause is that the planner removes the RelabelType that is on the indexed variable. So ruleutils sees that the operator is not the one that would be chosen by the parser given those input expressions (cf generate_operator_name) and it concludes it'd better schema-qualify the operator. While that doesn't really make any difference in this particular case, it is the right thing in typical rule-deparsing cases. I don't think this output is really wrong, and I'm not in favor of adding nontrivial overhead to make it better, so I don't like the proposed patch. Maybe generate_operator_name could use some other heuristic, but I'm unsure what. The case it wants to cover is that the operator is in the search path but is masked by some operator in an earlier schema, so simply testing if the operator's schema is in the path at all would be insufficient. regards, tom lane
Re: MergeJoin beats HashJoin in the case of multiple hash clauses
On 3/11/2023 23:43, Tomas Vondra wrote: On 9/11/23 10:04, Lepikhov Andrei wrote: * Determine bucketsize fraction and MCV frequency for the inner * relation. We use the smallest bucketsize or MCV frequency estimated * for any individual hashclause; this is undoubtedly conservative. I'm sure this may lead to inflated cost for "good" cases (where the actual bucket size really is a product), which may push the optimizer to use the less efficient/slower join method. Yes, It was contradictory idea, though. IMHO the only principled way forward is to get a better ndistinct estimate (which this implicitly does), perhaps by using extended statistics. I haven't tried, but I guess it'd need to extract the clauses for the inner side, and call estimate_num_groups() on it. And I've done it. Sorry for so long response. This patch employs of extended statistics for estimation of the HashJoin bucket_size. In addition, I describe the idea in more convenient form here [1]. Obviously, it needs the only ndistinct to make a prediction that allows to reduce computational cost of this statistic. [1] https://open.substack.com/pub/danolivo/p/why-postgresql-prefers-mergejoin?r=34q1yy&utm_campaign=post&utm_medium=web -- regards, Andrei Lepikhov From b4b007970b1d9b99602b8422f2122c8e5738828e Mon Sep 17 00:00:00 2001 From: "Andrei V. Lepikhov" Date: Mon, 13 May 2024 16:15:02 +0700 Subject: [PATCH] Use extended statistics for precise estimation of bucket size in hash join. Recognizing the real-life complexity where columns in the table often have functional dependencies, PostgreSQL's estimation of the number of distinct values over a set of columns can be underestimated (or much rarely, overestimated) when dealing with multi-clause JOIN. In the case of hash join, it can end up with a small number of predicted hash buckets and, as a result, picking non-optimal merge join. To improve the situation, we introduce one additional stage of bucket size estimation - having two or more join clauses estimator lookup for extended statistics and use it for multicolumn estimation. Clauses are grouped into lists, each containing expressions referencing the same relation. The result of the multicolumn estimation made over such a list is combined with others according to the caller's logic. Clauses that are not estimated are returned to the caller for further estimation. --- src/backend/optimizer/path/costsize.c | 12 +- src/backend/utils/adt/selfuncs.c | 171 ++ src/include/utils/selfuncs.h | 4 + 3 files changed, 186 insertions(+), 1 deletion(-) diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c index ee23ed7835..eafde90d4c 100644 --- a/src/backend/optimizer/path/costsize.c +++ b/src/backend/optimizer/path/costsize.c @@ -4250,9 +4250,19 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path, } else { + List *otherclauses; + innerbucketsize = 1.0; innermcvfreq = 1.0; - foreach(hcl, hashclauses) + + /* At first, try to estimate bucket size using extended statistics. */ + otherclauses = estimate_multivariate_bucketsize(root, + inner_path->parent, + hashclauses, + &innerbucketsize); + + /* Pass through the remaining clauses */ + foreach(hcl, otherclauses) { RestrictInfo *restrictinfo = lfirst_node(RestrictInfo, hcl); Selectivity thisbucketsize; diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index 5f5d7959d8..89ed2d136a 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -3751,6 +3751,177 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows, return numdistinct; } +/* + * Try to estimate the bucket size of the hash join inner side when the join + * condition contains two or more clauses by employing extended statistics. + * + * The main idea of this approach is that the distinct value generated by + * multivariate estimation on two or more columns would provide less bucket size + * than estimation on one separate column. + * + * IMPORTANT: It is crucial to synchronise the approach of combining different + * estimations with the caller's method. + * Return a list of clauses that didn't fetch any extended statistics. + */ +List * +estimate_multivariate_bucketsize(PlannerInfo *root, RelOptInfo *inner, +List *hashclauses, +
Re: Doc Rework: Section 9.16.13 SQL/JSON Query Functions
Thanks for the readthrough. On Sat, Jul 6, 2024 at 11:56 AM jian he wrote: > json_exists > "Returns true if the SQL/JSON path_expression applied to the > context_item using the PASSING values yields any items." > now you changed to > << > Returns true if the SQL/JSON path_expression applied to the > context_item. path_expression can reference variables named in the > PASSING clause. > << > Is it wrong? Yes, I mistakenly dropped "doesn't yield any items." > For both ON EMPTYand ON ERROR, > specifying ERROR will cause an error to be > thrown with > the appropriate message. Other options include returning a SQL NULL, > an > need one more whitespace. should be: > For both ON EMPTY and ON ERROR, Fixed. > Note that an error will be thrown if multiple values are returned and > WITHOUT WRAPPER is specified. > since not specify error on error, then no error will be thrown. maybe > rephrase to > It will be evaulated as error if multiple values are returned and > WITHOUT WRAPPER is specified. Done. > > For both ON EMPTY and ON ERROR, > specifying ERROR will cause an error to be > thrown with > the appropriate message. Other options include returning a SQL NULL, > an > empty array or object (array by default), or a user-specified > expression > that can be coerced to jsonb or the type specified in > RETURNING. > The default when ON EMPTY or ON > ERROR > is not specified is to return a SQL NULL value when the respective > situation occurs. > > in here, "empty array or object (array by default)", > I don't think people can understand the meaning of "(array by default)" . > > "or a user-specified expression" > maybe we can change to > "or a user-specified > DEFAULT expression" > I think "user-specified expression" didn't have much link with > "DEFAULT expression" Yes, specifying each option's syntax in parenthesis makes sense. > path_expression can reference variables named > in the PASSING clause. > do we need "The path_expression"? Fixed. > also maybe we can add > + In PASSING clause, varname is > +the variables name, value is the > + variables' value. Instead of expanding the description of the PASSING clause in each function's description, I've moved its description to the top paragraph with slightly different text. > we can add a PASSING clause example from sqljson_queryfuncs.sql , > since all three functions can use it. Done. > JSON_VALUE: > < of ON ERROR below). > then > << The ON ERROR and ON EMPTY clauses have similar semantics as > mentioned in the description of JSON_QUERY, except the set of values > returned in lieu of throwing an error is different. > > you first refer "below", then director to JSON_QUERY on error, on > empty description. > is the correct usage of "below"? > "(though see the discussion of ON ERROR below)." > i am not sure the meaning of "though" even watched this > https://www.youtube.com/watch?v=r-LphuCKQ0Q I've replaced the sentence with "(though see the discussion of ON ERROR below)" with this: "getting multiple values will be treated as an error." No need to reference the ON ERROR clause with that wording. > + context_item can be any character string that > + can be succesfully cast to jsonb. > > typo: "succesfully", should be "successfully" Fixed. > maybe rephrase it to: > + context_item can be jsonb type or any > character string that > + can be successfully cast to jsonb. Done. > +ON EMPTY expression (that is caused by empty > result > +of path_expressionevaluation). > need extra white space, should be > +of path_expression evaluation). Fixed. > +The default when ON EMPTY or ON > ERROR > +is not specified is to return a SQL NULL value when the respective > +situation occurs. > Correct me if I'm wrong. > we can just say: > +The default when ON EMPTY or ON > ERROR > +is not specified is to return an SQL NULL value. Agreed. > another tiny issue. > > -select json_query(jsonb '{"a": "[1, 2]"}', 'lax $.a' > OMIT QUOTES); > +JSON_QUERY(jsonb '{"a": "[1, 2]"}', 'lax $.a' OMIT > QUOTES); > [1, 2] > > > -select json_query(jsonb '{"a": "[1, 2]"}', 'lax $.a' > RETURNING int[] OMIT QUOTES ERROR ON ERROR); > +JSON_QUERY(jsonb '{"a": "[1, 2]"}', 'lax $.a' > RETURNING int[] OMIT QUOTES ERROR ON ERROR); > > These two example queries don't need semicolons at the end? Fixed. Also, I've removed the following sentence in the description of JSON_EXISTS, because a) it seems out of place - -Note that if the path_expression is -strict and ON ERROR behavior is -ERROR, an error is generated if it yields no items. - and b) does not seem correct: SELECT JSON_EXISTS(jsonb '{"key1": [1,2,3]}', 'strict $.key1[*] ? (@ > 3)' ERROR ON ERROR); json_ex
Re: a potential typo in comments of pg_parse_json
On Mon, Jul 8, 2024 at 5:25 PM Junwang Zhao wrote: > Not 100% sure, sorry if this doesn't make sense. > > --- a/src/common/jsonapi.c > +++ b/src/common/jsonapi.c > @@ -514,7 +514,7 @@ freeJsonLexContext(JsonLexContext *lex) > * > * If FORCE_JSON_PSTACK is defined then the routine will call the > non-recursive > * JSON parser. This is a useful way to validate that it's doing the right > - * think at least for non-incremental cases. If this is on we expect to see > + * thing at least for non-incremental cases. If this is on we expect to see > * regression diffs relating to error messages about stack depth, but no > * other differences. > */ Good catch. Fixed. -- Thanks, Amit Langote
Re: Injection point locking
On 25/06/2024 05:25, Noah Misch wrote: On Mon, Jun 24, 2024 at 11:03:09AM -0400, Tom Lane wrote: Heikki Linnakangas writes: ... I can't do that, because InjectionPointRun() requires a PGPROC entry, because it uses an LWLock. That also makes it impossible to use injection points in the postmaster. Any chance we could allow injection points to be triggered without a PGPROC entry? Could we use a simple spinlock instead? That sounds fine to me. If calling hash_search() with a spinlock feels too awful, a list to linear-search could work. With a fast path for the case that no injection points are attached or something? Even taking a spinlock in the postmaster is contrary to project policy. Maybe we could look the other way for debug-only code, but it seems like a pretty horrible precedent. If you're actually using an injection point in the postmaster, that would be the least of concerns. It is something of a concern for running an injection point build while not attaching any injection point. One solution could be a GUC to control whether the postmaster participates in injection points. Another could be to make the data structure readable with atomics only. I came up with the attached. It replaces the shmem hash table with an array that's scanned linearly. On each slot in the array, there's a generation number that indicates whether the slot is in use, and allows detecting concurrent modifications without locks. The attach/detach operations still hold the LWLock, but InjectionPointRun() is now lock-free, so it can be used without a PGPROC entry. It's now usable from postmaster too. However, it's theoretically possible that if shared memory is overwritten with garbage, the garbage looks like a valid injection point with a name that matches one of the injection points that postmaster looks at. That seems unlikely enough that I think we can accept the risk. To close that gap 100% I think a GUC is the only solution. Note that until we actually add an injection point to a function that runs in the postmaster, there's no risk. If we're uneasy about that, we could add an assertion to InjectionPointRun() to prevent it from running in the postmaster, so that we don't cross that line inadvertently. Thoughts? -- Heikki Linnakangas Neon (https://neon.tech) From ca331988b67937b942db5c55770919306931d1ac Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Mon, 8 Jul 2024 16:09:46 +0300 Subject: [PATCH 1/1] Use atomics to avoid locking InjectionPointRun() --- src/backend/utils/misc/injection_point.c | 359 +++ src/tools/pgindent/typedefs.list | 1 + 2 files changed, 243 insertions(+), 117 deletions(-) diff --git a/src/backend/utils/misc/injection_point.c b/src/backend/utils/misc/injection_point.c index 48f29e9b60..c5b2310532 100644 --- a/src/backend/utils/misc/injection_point.c +++ b/src/backend/utils/misc/injection_point.c @@ -21,7 +21,6 @@ #include "fmgr.h" #include "miscadmin.h" -#include "port/pg_bitutils.h" #include "storage/fd.h" #include "storage/lwlock.h" #include "storage/shmem.h" @@ -31,22 +30,35 @@ #ifdef USE_INJECTION_POINTS -/* - * Hash table for storing injection points. - * - * InjectionPointHash is used to find an injection point by name. - */ -static HTAB *InjectionPointHash; /* find points from names */ - /* Field sizes */ #define INJ_NAME_MAXLEN 64 #define INJ_LIB_MAXLEN 128 #define INJ_FUNC_MAXLEN 128 #define INJ_PRIVATE_MAXLEN 1024 -/* Single injection point stored in InjectionPointHash */ +/* Single injection point stored in shared memory */ typedef struct InjectionPointEntry { + /* + * Because injection points need to be usable without LWLocks, we use a + * generation counter on each entry to to allow safe, lock-free reading. + * + * To read an entry, first read the current 'generation' value. If it's + * even, then the slot is currently unused, and odd means it's in use. + * When reading the other fields, beware that they may change while + * reading them, if the entry is released and reused! After reading the + * other fields, read 'generation' again: if its value hasn't changed, you + * can be certain that the other fields you read are valid. Otherwise, + * the slot was concurrently recycled, and you should ignore it. + * + * When adding an entry, you must store all the other fields first, and + * then update the generation number, with an appropriate memory barrier + * in between. In addition to that protocol, you must also hold + * InjectionPointLock, to protect two backends modifying the array at the + * same time. + */ + pg_atomic_uint64 generation; + char name[INJ_NAME_MAXLEN]; /* hash key */ char library[INJ_LIB_MAXLEN]; /* library */ char function[INJ_FUNC_MAXLEN]; /* function */ @@ -58,8 +70,22 @@ typedef struct InjectionPointEntry char private_data[INJ_PRIVATE_MAXLEN]; } InjectionPointEntry; -#define INJECTION_POINT_HASH_INIT_SIZE 16 -#define INJECTION_POINT_
Re: pg_wal_summary_contents() and pg_walsummary may return different results on the same WAL summary file
On Thu, Jul 4, 2024 at 6:16 AM Fujii Masao wrote: > Yes, so I updated the commit message. I borrowed your description and used it > in the message. Attached is the revised version of the patch. > > If there are no objections, I will commit and backpatch it. +1. Maybe change "Fix bugs in pg_wal_summary_contents()" to "Fix limit block handling in pg_wal_summary_contents()". -- Robert Haas EDB: http://www.enterprisedb.com
Re: Pluggable cumulative statistics
Hi, On Mon, Jul 08, 2024 at 07:22:32AM +, Bertrand Drouvot wrote: > Except the above (which is just a Nit), 0001 LGTM. > Looking at 0002: It looks pretty straightforward, just one comment: + ptr = ((char *) ctl) + kind_info->shared_ctl_off; + kind_info->init_shmem_cb((void *) ptr); I don't think we need to cast ptr to void when calling init_shmem_cb(). Looking at some examples in the code, it does not look like we cast the argument to void when a function has (void *) as parameter (also there is examples in 0003 where it's not done, see next comments for 0003). So I think removing the cast here would be more consistent. Looking at 0003: It looks pretty straightforward. Also for example, here: + fputc(PGSTAT_FILE_ENTRY_FIXED, fpout); + write_chunk_s(fpout, &kind); write_chunk(fpout, ptr, info->shared_data_len); ptr is not casted to void when calling write_chunk() while its second parameter is a "void *". + ptr = ((char *) shmem) + info->shared_ctl_off + + info->shared_data_off; + + if (!read_chunk(fpin, ptr, Same here for read_chunk(). I think that's perfectly fine and that we should do the same in 0002 when calling init_shmem_cb() for consistency. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Optimize commit performance with a large number of 'on commit delete rows' temp tables
Hi wenhui, I carefully analyzed the reason for the performance regression with fewer temporary tables in the previous patch (v1-0002-): the k_hash_funcs determined by the bloom_create function were 10(MAX_HASH_FUNCS), which led to an excessive calculation overhead for the bloom filter. Based on the calculation formula for the bloom filter, when the number of items is 100 and k_hash_funcs is 2, the false positive rate for a 1KB bloom filter is 0.0006096; when the number of items is 1000, the false positive rate is 0.048929094. Therefore, k_hash_funcs of 2 can already achieve a decent false positive rate, while effectively reducing the computational overhead of the bloom filter. I have re-implemented a bloom_create_v2 function to create a bloom filter with a specified number of hash functions and specified memory size. From the test data below, it can be seen that the new patch in the attachment (v1-0003-) does not lead to performance regression in any scenario. Furthermore, the default threshold value can be lowered to 2. Here is the TPS performance data for different numbers of temporary tables under different thresholds, as compared with the head (98347b5a). The testing tool used is pgbench, with the workload being to insert into one temporary table (when the number of temporary tables is 0, the workload is SELECT 1): |tablenum |0 |1 |2 |5 |10 |100 |1000 | |--|||---|---|---|---|---| |head(98347b5a)|39912.722209|10064.306268|9183.871298|7452.071689|5641.487369|1073.203851|114.530958 | |threshold-2 |40097.047974|10009.598155|9982.172866|9955.651235|.338901|9785.626296|8278.828828| Here is the TPS performance data for different numbers of temporary tables at a threshold of 2, compared with the head (commit 98347b5a). The testing tool is pgbench, with the workload being to insert into all temporary tables: |table num |1 |2 | 5 |10 |100 |1000 | |--|---|---|---|---|--|-| |head(98347b5a)|7243.945042|5734.545012|3627.290594|2262.594766|297.856756|27.745808| |threshold-2 |7289.171381|5740.849676|3626.135510|2207.439931|293.145036|27.020953| I have previously researched the implementation of the Global Temp Table (GTT) you mentioned, and it have been used in Alibaba Cloud's PolarDB (Link [1]). GTT can prevent truncation operations on temporary tables that have not been accessed by the current session (those not in the OnCommitItem List), but GTT that have been accessed by the current session still need to be truncated at commit time.Therefore, GTT also require the optimizations mentioned in the above patch. [1] https://www.alibabacloud.com/help/en/polardb/polardb-for-oracle/using-global-temporary-tables?spm=a3c0i.23458820.2359477120.1.66e16e9bUpV7cK Best Regards, Fei Changhong v1-0003-Optimize-commit-with-temp-tables-guc.patch Description: Binary data v1-0003-Optimize-commit-with-temp-tables-without-guc.patch Description: Binary data
Re: Injection point locking
Heikki Linnakangas writes: > Note that until we actually add an injection point to a function that > runs in the postmaster, there's no risk. If we're uneasy about that, we > could add an assertion to InjectionPointRun() to prevent it from running > in the postmaster, so that we don't cross that line inadvertently. As long as we consider injection points to be a debug/test feature only, I think it's a net positive that one can be set in the postmaster. I'd be considerably more uncomfortable if somebody wanted to do that in production, but maybe it'd be fine even then. regards, tom lane
Re: 回复:Re: 回复:Re: speed up pg_upgrade with large number of tables
On Mon, Jul 08, 2024 at 03:22:36PM +0800, 杨伯宇(长堂) wrote: > Besides, https://commitfest.postgresql.org/48/4995/ seems insufficient to > this situation. Some time-consuming functions like check_for_data_types_usage > are not yet able to run in parallel. But these patches could be a great > starting point for a more efficient parallelism implementation. Maybe we can > do it later. I actually just wrote up the first version of the patch for parallelizing the data type checks over the weekend. I'll post it shortly. -- nathan
Re: array_in sub function ReadArrayDimensions error message
jian he writes: > while reviewing the json query doc, > I found out the following error message was not quite right. > select '[1,2]'::int[]; > ERROR: malformed array literal: "[1,2]" > LINE 1: select '[1,2]'::int[]; >^ > DETAIL: Missing "]" after array dimensions. > should it be: > "Missing delimiter ":" while specifying array dimensions." That's presuming quite a lot about the nature of the error. All the code knows is that what follows the "1" should be either ":" or "]", and when it sees "," instead it throws this error. I agree the existing message isn't great, but trying to be more specific could confuse people even more if the more-specific message doesn't apply either. One possibility could be if (*p != ']') ereturn(escontext, false, (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), errmsg("malformed array literal: \"%s\"", origStr), + (strchr(p, ']') != NULL) ? + errdetail("Array dimensions have invalid syntax.") : errdetail("Missing \"%s\" after array dimensions.", "]"))); that is, only say "Missing "]"" if there's no ']' anywhere, and otherwise just say the dimensions are wrong. This could be fooled by a ']' that's part of some string in the data, but even then the errdetail isn't wrong. Or we could just say "Array dimensions have invalid syntax." unconditionally. regards, tom lane
Re: 回复: An implementation of multi-key sort
On 7/4/24 14:45, Yao Wang wrote: > Hi John, > > Thanks for your kind message. I talked to Heikki before getting Tomas's > response, and he said "no promise but I will take a look". That's why I > added his email. I have updated the CF entry and added Tomas as reviewer. > > Hi Tomas, > > Again, I'd say a big thank to you. The report and script are really, really > helpful. And your ideas are very valuable. > > Firstly, the expectation of mksort performance: > > 1. When mksort works well, it should be faster than qsort because it saves > the cost of comparing duplicated values every time. > 2. When all values are distinct at a particular column, the comparison > will finish immediately, and mksort will actually fall back to qsort. For > the case, mksort should be equal or a bit slower than qsort because it need > to maintain more complex state. > > Generally, the benefit of mksort is mainly from duplicated values and sort > keys: the more duplicated values and sort keys are, the bigger benefit it > gets. > > Analysis on the report in your previous mail > > > 1. It seems the script uses $count to specify the duplicated values: > > number of repetitions for each value (ndistinct = nrows/count) > > However, it is not always correct. For type text, the script generates > values like this: > > expr="md5(((i / $count) + random())::text)" > > But md5() generates totally random values regardless of $count. Some cases > of timestamptz have the same problem. > > For all distinct values, the sort will finish at first depth and fall to > qsort actually. > You're right, thanks for noticing / fixing this. > 2. Even for the types with correct duplicated setting, the duplicated ratio > is very small: e.g. say $nrows = 1 and $count = 100, only 1% duplicated > rows can go to depth 2, and only 0.01% of them can go to depth 3. So it still > works on nearly all distinct values. > True, but that's why the scripts test with much larger data sets too, with more comparisons needing to look at other columns. It's be possible to construct data sets that are likely to benefit more from mksort - I'm not against doing that, but then there's the question of what data sets are more representative of what users actually do. I'd say a random data set like the ones I used are fairly common - it's fine to not improve them, but we should not regress them. > 3. Qsort of PG17 uses kind of specialization for tuple comparator, i.e. it > uses specialized functions for different types, e.g. qsort_tuple_unsigned() > for unsigned int. The specialized comparators avoid all type related checks > and are much faster than regular comparator. That is why we saw 200% or more > regression for the cases. > OK, I'm not familiar with this code enough to have an opinion. > > Code optimizations I did for mk qsort > - > > 1. Adapted specialization for tuple comparator. > 2. Use kind of "hybrid" sort: when we actually adapt bubble sort due to > limited sort items, use bubble sort to check datums since specified depth. > 3. Other other optimizations such as pre-ordered check. > > > Analysis on the new report > -- > > I also did some modifications to your script about the issues of data types, > plus an output about distinct value count/distinct ratio, and an indicator > for improvement/regression. I attached the new script and a report on a > data set with 100,000 rows and 2, 5, 8 columns. > OK, but I think a report for a single data set size is not sufficient to evaluate a patch like this, it can easily miss various caching effects etc. The results I shared a couple minutes ago are from 1000 to 10M rows, and it's much more complete view. > 1. Generally, the result match the expectation: "When mksort works well, it > should be faster than qsort; when mksort falls to qsort, it should be equal > or a bit slower than qsort." The challenge is how to know in advance if mksort is likely to work well. > 2. For all values of "sequential" (except text type), mksort is a bit slower > than qsort because no actual sort is performed due to the "pre-ordered" > check. OK > 3. For int and bigint type, mksort became faster and faster when > there were more and more duplicated values and sort keys. Improvement of > the best cases is about 58% (line 333) and 57% (line 711). I find it hard to interpret the text-only report, but I suppose these are essentially the "green" patches in the PDF report I attached to my earlier message. And indeed, there are nice improvements, but only with cases with very many duplicates, and the price for that is 10-20% regressions in the other cases. That does not seem like a great trade off to me. > 4. For timestamptz type, mksort is a bit slower than qsort because the > distinct ratio is always 1 for almost all cases. I think more benefit is > available by increasing the duplicated values. Yeah, this was a b
Re: optimizing pg_upgrade's once-in-each-database steps
As I mentioned elsewhere [0], here's a first attempt at parallelizing the data type checks. I was worried that I might have to refactor Daniel's work in commit 347758b quite significantly, but I was able to avoid that by using a set of generic callbacks and providing each task step an index to the data_types_usage_checks array. [0] https://postgr.es/m/Zov5kHbEyDMuHJI_%40nathan -- nathan >From 0d3353c318ca4dcd582d67ca25c47b9e3cf24921 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Fri, 28 Jun 2024 11:02:44 -0500 Subject: [PATCH v3 1/7] introduce framework for parallelizing pg_upgrade tasks --- src/bin/pg_upgrade/Makefile | 1 + src/bin/pg_upgrade/async.c | 323 +++ src/bin/pg_upgrade/meson.build | 1 + src/bin/pg_upgrade/pg_upgrade.h | 16 ++ src/tools/pgindent/typedefs.list | 4 + 5 files changed, 345 insertions(+) create mode 100644 src/bin/pg_upgrade/async.c diff --git a/src/bin/pg_upgrade/Makefile b/src/bin/pg_upgrade/Makefile index bde91e2beb..3bc4f5d740 100644 --- a/src/bin/pg_upgrade/Makefile +++ b/src/bin/pg_upgrade/Makefile @@ -12,6 +12,7 @@ include $(top_builddir)/src/Makefile.global OBJS = \ $(WIN32RES) \ + async.o \ check.o \ controldata.o \ dump.o \ diff --git a/src/bin/pg_upgrade/async.c b/src/bin/pg_upgrade/async.c new file mode 100644 index 00..7df1e7712d --- /dev/null +++ b/src/bin/pg_upgrade/async.c @@ -0,0 +1,323 @@ +/* + * async.c + * + * parallelization via libpq's async APIs + * + * Copyright (c) 2024, PostgreSQL Global Development Group + * src/bin/pg_upgrade/async.c + */ + +#include "postgres_fe.h" + +#include "common/connect.h" +#include "fe_utils/string_utils.h" +#include "pg_upgrade.h" + +static int dbs_complete; +static int dbs_processing; + +typedef struct AsyncTaskCallbacks +{ + AsyncTaskGetQueryCB query_cb; + AsyncTaskProcessCB process_cb; + boolfree_result; + void *arg; +} AsyncTaskCallbacks; + +typedef struct AsyncTask +{ + AsyncTaskCallbacks *cbs; + int num_cb_sets; +} AsyncTask; + +typedef enum +{ + FREE, + CONNECTING, + SETTING_SEARCH_PATH, + RUNNING_QUERY, +} AsyncSlotState; + +typedef struct +{ + AsyncSlotState state; + int db; + int query; + PGconn *conn; +} AsyncSlot; + +AsyncTask * +async_task_create(void) +{ + return pg_malloc0(sizeof(AsyncTask)); +} + +void +async_task_free(AsyncTask *task) +{ + if (task->cbs) + pg_free(task->cbs); + + pg_free(task); +} + +void +async_task_add_step(AsyncTask *task, + AsyncTaskGetQueryCB query_cb, + AsyncTaskProcessCB process_cb, bool free_result, + void *arg) +{ + AsyncTaskCallbacks *new_cbs; + + task->cbs = pg_realloc(task->cbs, + ++task->num_cb_sets * sizeof(AsyncTaskCallbacks)); + + new_cbs = &task->cbs[task->num_cb_sets - 1]; + new_cbs->query_cb = query_cb; + new_cbs->process_cb = process_cb; + new_cbs->free_result = free_result; + new_cbs->arg = arg; +} + +static void +conn_failure(PGconn *conn) +{ + pg_log(PG_REPORT, "%s", PQerrorMessage(conn)); + printf(_("Failure, exiting\n")); + exit(1); +} + +static void +start_conn(const ClusterInfo *cluster, AsyncSlot *slot) +{ + PQExpBufferData conn_opts; + + /* Build connection string with proper quoting */ + initPQExpBuffer(&conn_opts); + appendPQExpBufferStr(&conn_opts, "dbname="); + appendConnStrVal(&conn_opts, cluster->dbarr.dbs[slot->db].db_name); + appendPQExpBufferStr(&conn_opts, " user="); + appendConnStrVal(&conn_opts, os_info.user); + appendPQExpBuffer(&conn_opts, " port=%d", cluster->port); + if (cluster->sockdir) + { + appendPQExpBufferStr(&conn_opts, " host="); + appendConnStrVal(&conn_opts, cluster->sockdir); + } + + slot->conn = PQconnectStart(conn_opts.data); + termPQExpBuffer(&conn_opts); + + if (!slot->conn) + conn_failure(slot->conn); +} + +static void +dispatch_query(const ClusterInfo *cluster, AsyncSlot *slot, + const AsyncTask *task) +{ + AsyncTaskCallbacks *cbs = &task->cbs[slot->query]; + AsyncTaskGetQueryCB get_query = cbs->query_cb; + DbInfo *dbinfo = &cluster->dbarr.dbs[slot->db]; + char *query = (*get_query) (dbinfo, cbs->arg); + + if (!PQsendQuery(slot->conn, query)) + conn_failure(slot->conn); + + pg_free(query); +} + +static PGresult * +get_last_result(PGconn *conn) +{ + PGresult *tmp; + PGresult *res = NULL; + + while ((tmp = PQgetResult(conn)) != NULL) + { + PQclear(res);
Re: 回复: An implementation of multi-key sort
On 7/7/24 08:32, Konstantin Knizhnik wrote: > > On 04/07/2024 3:45 pm, Yao Wang wrote: >> Generally, the benefit of mksort is mainly from duplicated values and >> sort >> keys: the more duplicated values and sort keys are, the bigger benefit it >> gets. > ... >> 1. Use distinct stats info of table to enable mksort >> >> It's kind of heuristics: in optimizer, check >> Form_pg_statistic->stadistinct >> of a table via pg_statistics. Enable mksort only when it is less than a >> threshold. >> >> The hacked code works, which need to modify a couple of interfaces of >> optimizer. In addition, a complete solution should consider types and >> distinct values of all columns, which might be too complex, and the >> benefit >> seems not so big. > > > If mksort really provides advantage only when there are a lot of > duplicates (for prefix keys?) and of small fraction of duplicates there > is even some (small) regression > then IMHO taking in account in planner information about estimated > number of distinct values seems to be really important. What was a > problem with accessing this statistics and why it requires modification > of optimizer interfaces? There is `get_variable_numdistinct` function > which is defined and used only in selfuncs.c > Yeah, I've been wondering about that too. But I'm also a bit unsure if using this known-unreliable statistics (especially with filters and multiple columns) would actually fix the regressions. > Information about values distribution seems to be quite useful for > choosing optimal sort algorithm. Not only for multi-key sort > optimization. For example if we know min.max value of sort key and it is > small, we can use O(N) algorithm for sorting. Also it can help to > estimate when TOP-N search is preferable. > This assumes the information is accurate / reliable, and I'm far from sure about that. > Right now Posgres creates special path for incremental sort. I am not > sure if we also need to be separate path for mk-sort. > But IMHO if we need to change some optimizer interfaces to be able to > take in account statistic and choose preferred sort algorithm at > planning time, then it should be done. > If mksort can increase sort more than two times (for large number of > duplicates), it will be nice to take it in account when choosing optimal > plan. > I did commit the incremental sort patch, and TBH I'm not convinced I'd do that again. It's a great optimization when it works (and it seems to work in plenty of cases), but we've also had a number of reports about significant regressions, where the incremental sort costing is quite off. Granted, it's often about cases where we already had issues and incremental sort just "exacerbates" that (say, with LIMIT queries), but that's kinda the point I'm trying to make - stats are inherently incomplete / simplified, and some plans are more sensitive to that. Which is why I'm wondering if we might do the decision based on some information collected at runtime. > Also in this case we do not need extra GUC for explicit enabling of > mksort. There are too many parameters for optimizer and adding one more > will make tuning more complex. So I prefer that decision is take buy > optimizer itself based on the available information, especially if > criteria seems to be obvious. > The GUC is very useful for testing, so let's keep it for now. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: ❓ JSON Path Dot Precedence
Hi, following up on some old threads. > On Apr 10, 2024, at 16:44, David E. Wheeler wrote: > > That makes sense, thanks. It’s just a little odd to me that the resulting > path isn’t a query at all. To Erik’s point: what path can `'0x2.p10` even > select? I’m wondering whether the jsonpath parser should be updated to reject cases like this. I think it will always return no results. AFAICT, there’s no way to navigate to an object identifier immediately after a number: david=# select '0x2.p10'::jsonpath; jsonpath --- (2)."p10" (1 row) david=# select jsonb_path_query(target => '[0, 1, {"p10": true}]', path => '0x2.p10'); jsonb_path_query -- (0 rows) david=# select jsonb_path_query(target => '{"0x2": {"p10": true}}', path => '0x2.p10'); jsonb_path_query -- (0 rows) It’s just inherently meaningless. BTW, it’s not limited to hex numbers: david=# select '(2).p10'::jsonpath; jsonpath --- (2)."p10" OTOH, maybe that’s a corner case we can live with. Best, David
Re: Parallel CREATE INDEX for GIN indexes
On Mon, 8 Jul 2024, 13:38 Tomas Vondra, wrote: > > On 7/8/24 11:45, Matthias van de Meent wrote: > > As to the GIN TS code itself; yes it's more complicated, mainly > > because it uses several optimizations to reduce unnecessary > > allocations and (de)serializations of GinTuples, and I'm aware of even > > more such optimizations that can be added at some point. > > > > As an example: I suspect the use of GinBuffer in writetup_index_gin to > > be a significant resource drain in my patch because it lacks > > "freezing" in the tuplesort buffer. When no duplicate key values are > > encountered, the code is nearly optimal (except for a full tuple copy > > to get the data into the GinBuffer's memory context), but if more than > > one GinTuple has the same key in the merge phase we deserialize both > > tuple's posting lists and merge the two. I suspect that merge to be > > more expensive than operating on the compressed posting lists of the > > GinTuples themselves, so that's something I think could be improved. I > > suspect/guess it could save another 10% in select cases, and will > > definitely reduce the memory footprint of the buffer. > > Another thing that can be optimized is the current approach of > > inserting data into the index: I think it's kind of wasteful to > > decompress and later re-compress the posting lists once we start > > storing the tuples on disk. > > > > I need to experiment with this a bit more, to better understand the > behavior and pros/cons. But one thing that's not clear to me is why > would this be better than simply increasing the amount of memory for the > initial BuildAccumulator buffer ... > > Wouldn't that have pretty much the same effect? I don't think so: The BuildAccumulator buffer will probably never be guaranteed to have space for all index entries, though it does use memory more efficiently than Tuplesort. Therefore, it will likely have to flush keys multiple times into sorted runs, with likely duplicate keys existing in the tuplesort. My patch 0008 targets the reduction of IO and CPU once BuildAccumulator has exceeded its memory limits. It reduces the IO and computational requirement of Tuplesort's sorted-run merging by merging the tuples in those sorted runs in that merge process, reducing the number of tuples, bytes stored, and number of compares required in later operations. It enables some BuildAccumulator-like behaviour inside the tuplesort, without needing to assign huge amounts of memory to the BuildAccumulator by allowing efficient spilling to disk of the incomplete index data. And, during merging, it can work with O(n_merge_tapes * tupsz) of memory, rather than O(n_distinct_keys * tupsz). This doesn't make BuildAccumulator totally useless, but it does reduce the relative impact of assigning more memory. One significant difference between the modified Tuplesort and BuildAccumulator is that the modified Tuplesort only merges the entries once they're getting written, i.e. flushed from the in-memory structure; while BuildAccumulator merges entries as they're being added to the in-memory structure. Note that this difference causes BuildAccumulator to use memory more efficiently during in-memory workloads (it doesn't duplicate keys in memory), but as BuildAccumulator doesn't have spilling it doesn't handle full indexes' worth of data (it does duplciate keys on disk). I hope this clarifies things a bit. I'd be thrilled if we'd be able to put BuildAccumulator-like behaviour into the in-memory portion of Tuplesort, but that'd require a significantly deeper understanding of the Tuplesort internals than what I currently have, especially in the area of its memory management. Kind regards Matthias van de Meent Neon (https://neon.tech)
Re: Commitfest manager for July 2024
On 7/3/24 12:51, Andrey M. Borodin wrote: On 3 Jul 2024, at 01:08, Corey Huinker wrote: I'll give it a shot. Great, thank you! Do you have extended access to CF? Like activity log and mass-mail functions? If no I think someone from PG_INFRA can grant you necessary access. I can do that, although given that Corey and I are colleagues, it might be better if someone else on pginfra did. Or at least if a few other hackers tell me to "just do it"... -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: RFC: Additional Directory for Extensions
On Jun 25, 2024, at 18:31, David E. Wheeler wrote: > For those who prefer a GitHub patch review experience, see this PR: > > https://github.com/theory/postgres/pull/3/files Rebased and restored PGC_SUSET in the attached v5 patch, plus noted the required privileges in the docs. Best, David v5-0001-Add-extension_destdir-GUC.patch Description: Binary data
Re: ❓ JSON Path Dot Precedence
On Mon, Jul 8, 2024 at 8:27 AM David E. Wheeler wrote: > Hi, following up on some old threads. > > > On Apr 10, 2024, at 16:44, David E. Wheeler > wrote: > > > > That makes sense, thanks. It’s just a little odd to me that the > resulting path isn’t a query at all. To Erik’s point: what path can > `'0x2.p10` even select? > > I’m wondering whether the jsonpath parser should be updated to reject > cases like this. I think it will always return no results. AFAICT, there’s > no way to navigate to an object identifier immediately after a number: > > If we go down this path wouldn't the correct framing be: do not allow accessors after scalars ? The same argument applies to false/"john" and other scalar types since by definition none of them have subcomponents to be accessed. That said, the parser has a lax mode which somewhat implies it doesn't expect the jsonpath type to perform much in the way of validation of the semantic correctness of the encoded path expression. I like the idea of a smarter expression-holding type and would even wish to have had this on day one. Retrofitting is less appealing. We document a similarity with regular expressions here where we, for better and worse, have lived without a regexppath data type forever and leave it to the executor to tell the user their pattern is invalid. Leaning on that precedence here makes accepting the status quo more reasonable. Though strict/lax modes and, I think, variables, motivates me to put my vote toward the "do more validation" group. Does the standard even have a separate type here or is that our implementation detail invention? David J.
Re: Patch bug: Fix jsonpath .* on Arrays
On Jun 27, 2024, at 04:17, Michael Paquier wrote: > The tests of jsonb_jsonpath.sql include a lot of patterns for @? and > jsonb_path_query with the lax and strict modes, so shouldn't these > additions be grouped closer to the existing tests rather than added at > the end of the file? I’ve moved them closer to other tests for unwrapping behavior in the attached updated and rebased v3 patch. Best, David v3-0001-Add-tests-for-jsonpath-.-on-arrays.patch Description: Binary data
Re: ❓ JSON Path Dot Precedence
On Jul 8, 2024, at 12:05, David G. Johnston wrote: > If we go down this path wouldn't the correct framing be: do not allow > accessors after scalars ? The same argument applies to false/"john" and > other scalar types since by definition none of them have subcomponents to be > accessed. Yes, excellent point. > That said, the parser has a lax mode which somewhat implies it doesn't expect > the jsonpath type to perform much in the way of validation of the semantic > correctness of the encoded path expression. My understanding is that lax mode means it ignores where the JSON doesn’t abide by expectations of the path expression, not that the path parsing is lax. > I like the idea of a smarter expression-holding type and would even wish to > have had this on day one. Retrofitting is less appealing. We document a > similarity with regular expressions here where we, for better and worse, have > lived without a regexppath data type forever and leave it to the executor to > tell the user their pattern is invalid. Leaning on that precedence here > makes accepting the status quo more reasonable. Though strict/lax modes and, > I think, variables, motivates me to put my vote toward the "do more > validation" group. This feels different from a documented difference in behavior as an implementation choice, like path regex vs. Spencer. In this case, the expression is technically meaningless, but there’s never so much as an error thrown. > Does the standard even have a separate type here or is that our > implementation detail invention? Sorry, separate type for what? Best, David
Re: ❓ JSON Path Dot Precedence
On Mon, Jul 8, 2024 at 9:12 AM David E. Wheeler wrote: > On Jul 8, 2024, at 12:05, David G. Johnston > wrote: > > > Does the standard even have a separate type here or is that our > implementation detail invention? > > Sorry, separate type for what? > > We created a data type named: jsonpath. Does the standard actually have that data type and defined parsing behavior or does it just have functions where one of the inputs is text whose contents are a path expression? David J.
Re: Add GiST support for mixed-width integer operators
On 7/6/24 05:04, Andrey M. Borodin wrote:>> On 5 Jul 2024, at 23:46, Paul Jungwirth wrote: this commit adds support for all combinations of int2/int4/int8 for all five btree operators (=/>). Looks like a nice feature to have. Would it make sense to do something similar to float8? Or, perhaps, some other types from btree_gist? Here is another patch adding float4/8 and also date/timestamp/timestamptz, in the same combinations as btree. No other types seem like they deserve this treatment. For example btree doesn't mix oids with ints. Also, are we sure such tests will be stable? You're right, it was questionable. We hadn't analyzed the table, and after doing that the plan changes from a bitmap scan to an index-only scan. That makes more sense, and I doubt it will change now that it's based on statistics. Yours, -- Paul ~{:-) p...@illuminatedcomputing.comFrom c3b5f0c135b1d430f9ebf1367644f70726f5bb41 Mon Sep 17 00:00:00 2001 From: "Paul A. Jungwirth" Date: Fri, 5 Jul 2024 11:16:09 -0700 Subject: [PATCH v2] Use GiST index with mixed integer/float/timestamp widths A GiST index on a bigint will not be used when compared to an int literal, unless the user casts the literal to a bigint. This is surprising and causes invisible slowness. To fix that we can add pg_amop entries for mixed-width arguments. We do that for all int2/int4/int8 combinations, and also for float4/float8 and date/timestamp/timestamptz (not that timestamptz is wider than timestamp, but we can follow btree in allowing mixed operators for them too). With btree, we have one opfamily for all integer types, and it supports =(int2,int8), etc. With GiST we have a separate opfamily for each width, so it's less obvious where to put the mixed-width operators. But it makes sense for wider opfamilies to include support for smaller types, so I added =(int2,int8) and =(int4,int8) to gist_int8_ops, and =(int2,int4) to gist_int4_ops. Also =(float4,float8) to gist_float8_ops, =(date,timestamptz) and =(timestamp,timestamptz to gist_timestamptz_ops, and =(date,timestamp) to gist_timestamp_ops. And as long as we're doing this, we might as well support all five btree operators (<, <=, =, >=, >). --- contrib/btree_gist/Makefile | 3 +- contrib/btree_gist/btree_gist--1.7--1.8.sql | 101 contrib/btree_gist/btree_gist.control | 2 +- contrib/btree_gist/expected/float8.out | 12 +++ contrib/btree_gist/expected/int4.out| 12 +++ contrib/btree_gist/expected/int8.out| 21 contrib/btree_gist/expected/timestamp.out | 12 +++ contrib/btree_gist/expected/timestamptz.out | 22 + contrib/btree_gist/meson.build | 1 + contrib/btree_gist/sql/float8.sql | 8 ++ contrib/btree_gist/sql/int4.sql | 7 ++ contrib/btree_gist/sql/int8.sql | 10 ++ contrib/btree_gist/sql/timestamp.sql| 7 ++ contrib/btree_gist/sql/timestamptz.sql | 11 +++ 14 files changed, 227 insertions(+), 2 deletions(-) create mode 100644 contrib/btree_gist/btree_gist--1.7--1.8.sql diff --git a/contrib/btree_gist/Makefile b/contrib/btree_gist/Makefile index 073dcc745c4..965f0559d64 100644 --- a/contrib/btree_gist/Makefile +++ b/contrib/btree_gist/Makefile @@ -33,7 +33,8 @@ EXTENSION = btree_gist DATA = btree_gist--1.0--1.1.sql \ btree_gist--1.1--1.2.sql btree_gist--1.2.sql btree_gist--1.2--1.3.sql \ btree_gist--1.3--1.4.sql btree_gist--1.4--1.5.sql \ - btree_gist--1.5--1.6.sql btree_gist--1.6--1.7.sql + btree_gist--1.5--1.6.sql btree_gist--1.6--1.7.sql \ + btree_gist--1.7--1.8.sql PGFILEDESC = "btree_gist - B-tree equivalent GiST operator classes" REGRESS = init int2 int4 int8 float4 float8 cash oid timestamp timestamptz \ diff --git a/contrib/btree_gist/btree_gist--1.7--1.8.sql b/contrib/btree_gist/btree_gist--1.7--1.8.sql new file mode 100644 index 000..221df3cd6f8 --- /dev/null +++ b/contrib/btree_gist/btree_gist--1.7--1.8.sql @@ -0,0 +1,101 @@ +/* contrib/btree_gist/btree_gist--1.6--1.7.sql */ + +-- complain if script is sourced in psql, rather than via CREATE EXTENSION +\echo Use "ALTER EXTENSION btree_gist UPDATE TO '1.7'" to load this file. \quit + +-- Add mixed integer operators. +-- This lets Postgres use the index without casting literals to bigints, etc. +-- Whereas btree has one integer_ops opfamily, +-- GiST has gist_int2_ops, gist_int4_ops, and gist_int8_ops. +-- We add the mixed operators to whichever opfamily matches the larger type, +-- sort of like "promoting". +ALTER OPERATOR FAMILY gist_int8_ops USING gist ADD + OPERATOR 1 < (int8, int4), + OPERATOR 2 <= (int8, int4), + OPERATOR 3 = (int8, int4), + OPERATOR 4 >= (int8, int4), + OPERATOR 5 > (int8, int4), + + OPERATOR 1 < (int4, int8), + OPERATOR 2 <= (int4, int8), + OPERATOR 3 = (int4, int8), + OPERATOR 4 >= (int4, int8), + OPERATOR 5 > (int4, int8), + + OPERATOR 1 < (int8, int2), + OPERATOR 2 <
Re: Detoasting optionally to make Explain-Analyze less misleading
> > So thanks again! and this will really help a lot of people. I'd like to echo this thanks to you all. While looking to add support for SERIALIZE in an explain visualisation tool I work on, I realised there isn't yet an equivalent auto_explain parameter for SERIALIZE. I'm not sure if this is a deliberate omission (perhaps for a similar reason planning time is not included in auto_explain?), but I didn't see it mentioned above, so I thought best to ask in case not. Thanks again, Michael
Re: ❓ JSON Path Dot Precedence
On Jul 8, 2024, at 12:17, David G. Johnston wrote: > We created a data type named: jsonpath. Does the standard actually have that > data type and defined parsing behavior or does it just have functions where > one of the inputs is text whose contents are a path expression? Ah, got it. D
Re: Detoasting optionally to make Explain-Analyze less misleading
Michael Christofides writes: > While looking to add support for SERIALIZE in an explain visualisation tool > I work on, I realised there isn't yet an equivalent auto_explain parameter > for SERIALIZE. I'm not sure if this is a deliberate omission (perhaps for a > similar reason planning time is not included in auto_explain?), but I > didn't see it mentioned above, so I thought best to ask in case not. I'm not sure there's a need for it. When a query runs under auto_explain, the output values will be sent to the client, so those cycles should be accounted for anyway, no? (Perhaps the auto_explain documentation should mention this?) regards, tom lane
Re: Windows: openssl & gssapi dislike each other
Hi, On 2024-06-13 00:12:51 +0900, Imran Zaheer wrote: > I removed the macro from the sslinfo.c as suggested by Andrew. Then I > was thinking maybe we can undo some other similar code. What precisely do you mean by that? Just getting rid of the "ordered include" of openssl headers in {fe,be}-secure-openssl.h? > I rearranged the headers to their previous position in > be-secure-openssl.c and in fe-secure-openssl.c. I was able to compile > with gssapi and openssl enabled. You can look into the original commits. [1, > 2] > Is it ok if we undo the changes from these commits? > > I am attaching two new patches. > One with macro guards removed from ssinfo.c. > Second patch will additionally rearrange headers for > be-secure-openssl.c and in fe-secure-openssl.c to their previous > position. One thing that concerns me with this is that there are other includes of gssapi/gssapi.h (e.g. in , which haven't been changed here. ISTM we ought to do apply the same change to all of those, otherwise we're just waiting for the problem to re-appear. I wonder if we should add a src/include/libpq/pg-gssapi.h or such, which could wrap the entire ifdeferry for gss support. Something like #ifdef ENABLE_GSS #if defined(HAVE_GSSAPI_H) #include #include #else #include #include #endif /* * On Windows, includes a #define for X509_NAME, which breaks our * ability to use OpenSSL's version of that symbol if is pulled * in after ... and, at least on some builds, it is. We * can't reliably fix that by re-ordering #includes, because libpq/libpq-be.h * #includes . Instead, just zap the #define again here. */ #ifdef X509_NAME #undef X509_NAME #endif #endif /* ENABLE_GSS */ Which'd allow the various places using gss (libpq-be.h, be-gssapi-common.h, libpq-int.h) to just include pg-gssapi.h and get all of the above without redundancy? Another thing that concerns me about this approach is that it seems to assume that the only source of such conflicting includes is gssapi. What if some other header pulls in wincrypt.h? But I can't really see a way out of that... Greetings, Andres Freund
Re: jsonpath: Inconsistency of timestamp_tz() Output
On Jul 2, 2024, at 10:53, David E. Wheeler wrote: > ``` patch > --- a/src/test/regress/expected/jsonb_jsonpath.out > +++ b/src/test/regress/expected/jsonb_jsonpath.out > @@ -2914,7 +2914,7 @@ HINT: Use *_tz() function for time zone support. > select jsonb_path_query_tz('"2023-08-15"', '$.timestamp_tz()'); -- should work > jsonb_path_query_tz > - > - "2023-08-15T07:00:00+00:00" > + "2023-08-15T00:00:00-07:00" > (1 row) > > select jsonb_path_query('"12:34:56"', '$.timestamp_tz()'); > @@ -3003,7 +3003,7 @@ HINT: Use *_tz() function for time zone support. > select jsonb_path_query_tz('"2023-08-15 12:34:56"', '$.timestamp_tz()'); -- > should work > jsonb_path_query_tz > - > - "2023-08-15T12:34:56+00:00" > + "2023-08-15T12:34:56+10:00" > (1 row) > > select jsonb_path_query('"10-03-2017 12:34"', '$.datetime("dd-mm- > HH24:MI")'); > ``` FWIW I fixed this issue in my jsonpath port, which I released over the weekend.[1] You can see what I think should be the proper output for these two examples in these Playground links, where the response will use your browser’s time zone: [2], [3]. Best, David [1]: https://justatheory.com/2024/07/go-sqljson-path/ [2]: https://theory.github.io/sqljson/playground/?p=%2524.timestamp_tz%28%29&j=%25222023-08-15%2522&a=&o=49 [3]: https://theory.github.io/sqljson/playground/?p=%2524.timestamp_tz%28%29&j=%25222023-08-15%252012%253A34%253A56%2522&a=&o=49
Should we work around msvc failing to compile tab-complete.c?
Hi, Compiling postgres on windows with tab-completion support fails either with "fatal error C1026: parser stack overflow, program too complex”. or (in recent versions) with "…/src/bin/psql/tab-complete.c(4023): fatal error C1001: Internal compiler error." I've reported this to the visual studio folks at [1] and supposedly a fix is waiting to be released. I don't know if that's just for the internal compiler error in 2022 or lifting the limitation. It's pretty easy to work around the error [2]. I wonder if we should just do that, then we don't have to insist on a very new msvc version being used and it'll work even if they just decide to fix the internal compiler error. Besides just breaking the if-else-if chain at some arbitrary point, we could alternatively make a more general efficiency improvement, and add a bit of nesting at a few places. E.g. instead of having ~33 checks for COMMENT being the first word, we could use "outer" else-if for COMMENT and check the sub-variants inside that branch: else if (HeadMatches("COMMENT")) { if (Matches("COMMENT")) COMPLETE_WITH("ON"); else if (Matches("COMMENT", "ON")) ... } if we do that to one or two common keywords (COMMENT and DROP seems easiest) we'd be out from that limitation, and would probably reduce the number of cycles for completions noticeably. OTOH, that'd be a more noisy change and it'd be less defensible to apply it to 17 - which IMO would be nice to do. Greetings, Andres Freund [1] https://developercommunity.visualstudio.com/t/tab-completec4023:-fatal-error-C1001:/10685868 [2] https://postgr.es/m/CACLU5mRRLAW2kca3k2gVDM8kow6wgvT_BCaeg37jz%3DKyj1afvw%40mail.gmail.com
Re: debugging what might be a perf regression in 17beta2
My results have too much variance so this is a false alarm. One day I might learn whether the noise is from HW, Postgres or my test method. I ended up trying 10 builds between 17beta1 and 17beta2, but even with that I don't have a clear signal. On Fri, Jul 5, 2024 at 8:48 PM David Rowley wrote: > On Sat, 6 Jul 2024 at 15:11, MARK CALLAGHAN wrote: > > On small servers I have at home I can reproduce the problem without > concurrent queries and 17beta2 is 5% to 10% slower there. > > > > The SQL statement for the scan microbenchmark is: > > SELECT * from %s WHERE LENGTH(c) < 0 > > Can you share the CREATE TABLE and script to populate so others can try? > > Also, have you tried with another compiler? It does not seem > impossible that the refactoring done in heapam.c or the read stream > code might have changed something to make the binary more sensitive to > caching effects in this area. One thing I often try when I can't > pinpoint the exact offending commit is to write a script to try the > first commit of each day for, say, 30 days to see if there is any > saw-toothing in performance numbers over that period. > > David > -- Mark Callaghan mdcal...@gmail.com
Re: debugging what might be a perf regression in 17beta2
A writeup for the benchmark results is here - https://smalldatum.blogspot.com/2024/07/postgres-17beta2-vs-sysbench-looking.html pg17beta2 and pg17beta1 look good so far On Mon, Jul 8, 2024 at 10:49 AM MARK CALLAGHAN wrote: > My results have too much variance so this is a false alarm. One day I > might learn whether the noise is from HW, Postgres or my test method. > I ended up trying 10 builds between 17beta1 and 17beta2, but even with > that I don't have a clear signal. > > On Fri, Jul 5, 2024 at 8:48 PM David Rowley wrote: > >> On Sat, 6 Jul 2024 at 15:11, MARK CALLAGHAN wrote: >> > On small servers I have at home I can reproduce the problem without >> concurrent queries and 17beta2 is 5% to 10% slower there. >> > >> > The SQL statement for the scan microbenchmark is: >> > SELECT * from %s WHERE LENGTH(c) < 0 >> >> Can you share the CREATE TABLE and script to populate so others can try? >> >> Also, have you tried with another compiler? It does not seem >> impossible that the refactoring done in heapam.c or the read stream >> code might have changed something to make the binary more sensitive to >> caching effects in this area. One thing I often try when I can't >> pinpoint the exact offending commit is to write a script to try the >> first commit of each day for, say, 30 days to see if there is any >> saw-toothing in performance numbers over that period. >> >> David >> > > > -- > Mark Callaghan > mdcal...@gmail.com > -- Mark Callaghan mdcal...@gmail.com
Re: Detoasting optionally to make Explain-Analyze less misleading
> I'm not sure there's a need for it. When a query runs under > auto_explain, the output values will be sent to the client, > so those cycles should be accounted for anyway, no? > Yes, great point, the total duration reported by auto_explain includes it. Explicit serialization stats might still be helpful for folks when it is the bottleneck, but less useful for sure (especially if nothing else causes big discrepancies between the duration reported by auto_explain and the "actual total time" of the root node). (Perhaps the auto_explain documentation should mention this?) > I'd value this. I notice the folks working on the other new explain parameter (memory) opted to add a comment to the auto_explain source code to say it wasn't supported. Thanks again, Michael
Re: Confine vacuum skip logic to lazy_scan_skip
On Sun, Jul 7, 2024 at 10:49 AM Noah Misch wrote: > > On Fri, Jun 28, 2024 at 05:36:25PM -0400, Melanie Plageman wrote: > > I've attached a WIP v11 streaming vacuum patch set here that is > > rebased over master (by Thomas), so that I could add a CF entry for > > it. It still has the problem with the extra WAL write and fsync calls > > investigated by Thomas above. Thomas has some work in progress doing > > streaming write-behind to alleviate the issues with the buffer access > > strategy and streaming reads. When he gets a version of that ready to > > share, he will start a new "Streaming Vacuum" thread. > > To avoid reviewing the wrong patch, I'm writing to verify the status here. > This is Needs Review in the commitfest. I think one of these two holds: > > 1. Needs Review is valid. > 2. It's actually Waiting on Author. You're commissioning a review of the >future-thread patch, not this one. > > If it's (1), given the WIP marking, what is the scope of the review you seek? > I'm guessing performance is out of scope; what else is in or out of scope? Ah, you're right. I moved it to "Waiting on Author" as we are waiting on Thomas' version which has a fix for the extra WAL write/sync behavior. Sorry for the "Needs Review" noise! - Melanie
Re: Should we work around msvc failing to compile tab-complete.c?
Andres Freund writes: > Compiling postgres on windows with tab-completion support fails either with > "fatal error C1026: parser stack overflow, program too complex”. > or (in recent versions) with > "…/src/bin/psql/tab-complete.c(4023): fatal error C1001: Internal compiler > error." > It's pretty easy to work around the error [2]. I wonder if we should just do > that, then we don't have to insist on a very new msvc version being used and > it'll work even if they just decide to fix the internal compiler error. I'm on board with doing something here, but wouldn't we want to back-patch at least a minimal fix to all supported branches? As for the specific thing to do, that long if-chain seems horrid from an efficiency standpoint as well as stressing compiler implementations. I realize that this pretty much only needs to run at human-reaction-time speed, but it still offends my inner nerd. I also wonder when we are going to hit problems with some earlier test unexpectedly pre-empting some later one. Could we perhaps have a table of first words of each interesting match, and do a lookup in that before dispatching to code segments that are individually similar to what's there now? regards, tom lane
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Tue, Jul 2, 2024 at 7:07 PM Melanie Plageman wrote: > > On Mon, Jun 24, 2024 at 4:27 AM Heikki Linnakangas wrote: > > > > Would it be possible to make it robust so that we could always run it > > with "make check"? This seems like an important corner case to > > regression test. > > Okay, I've attached a new version of the patch and a new version of > the repro that may be fast and stable enough to commit. It is more > minimal than the previous version. I made the table as small as I > could to still trigger two rounds of index vacuuming. I tried to make > it as stable as possible. I also removed the cursor on the standby > that could trigger recovery conflicts. It would be super helpful if > someone could take a look at the test and point out any ways I could > make it even more likely to be stable. Attached v3 has one important additional component in the test -- I use pg_stat_progress_vacuum to confirm that we actually do more than one pass of index vacuuming. Otherwise, it would have been trivial for the test to incorrectly pass. I could still use another pair of eyes on the test (looking out for stability enhancing measures I could take). I also would be happy if someone looked at the commit message and/or comments to let me know if they make sense. I'll finish with versions of the patch and test targeted at v14-16 and propose those before committing this. - Melanie From b83a5e4ab84cb4b4de104fee05daf0bf88fe70ff Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Thu, 20 Jun 2024 16:38:30 -0400 Subject: [PATCH v3 1/2] Test that vacuum removes tuples older than OldestXmin If vacuum fails to prune a tuple killed before OldestXmin, it will decide to freeze its xmax and later error out in pre-freeze checks. --- src/test/recovery/meson.build | 1 + .../recovery/t/043_vacuum_horizon_floor.pl| 236 ++ 2 files changed, 237 insertions(+) create mode 100644 src/test/recovery/t/043_vacuum_horizon_floor.pl diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build index b1eb77b1ec1..1d55d6bf560 100644 --- a/src/test/recovery/meson.build +++ b/src/test/recovery/meson.build @@ -51,6 +51,7 @@ tests += { 't/040_standby_failover_slots_sync.pl', 't/041_checkpoint_at_promote.pl', 't/042_low_level_backup.pl', + 't/043_vacuum_horizon_floor.pl', ], }, } diff --git a/src/test/recovery/t/043_vacuum_horizon_floor.pl b/src/test/recovery/t/043_vacuum_horizon_floor.pl new file mode 100644 index 000..1b82d935be5 --- /dev/null +++ b/src/test/recovery/t/043_vacuum_horizon_floor.pl @@ -0,0 +1,236 @@ +use strict; +use warnings; +use PostgreSQL::Test::Cluster; +use Test::More; + +# Test that vacuum prunes away all dead tuples killed before OldestXmin + +# Set up nodes +my $node_primary = PostgreSQL::Test::Cluster->new('primary'); +$node_primary->init(allows_streaming => 'physical'); + +$node_primary->append_conf( + 'postgresql.conf', qq[ +hot_standby_feedback = on +log_recovery_conflict_waits = true +autovacuum = off +maintenance_work_mem = 1024 +]); +$node_primary->start; + +my $node_replica = PostgreSQL::Test::Cluster->new('standby'); + +$node_primary->backup('my_backup'); +$node_replica->init_from_backup($node_primary, 'my_backup', + has_streaming => 1); + +$node_replica->start; + +my $test_db = "test_db"; +$node_primary->safe_psql('postgres', "CREATE DATABASE $test_db"); + +# Save the original connection info for later use +my $orig_conninfo = $node_primary->connstr(); + +my $table1 = "vac_horizon_floor_table"; + +# Long-running Primary Session A +my $psql_primaryA = + $node_primary->background_psql($test_db, on_error_stop => 1); + +# Long-running Primary Session B +my $psql_primaryB = + $node_primary->background_psql($test_db, on_error_stop => 1); + +# Insert and update enough rows that we force at least one round of index +# vacuuming before getting to a dead tuple which was killed after the standby +# is disconnected. +# +# We need multiple index vacuuming passes to repro because after the standby +# reconnects to the primary, our backend's GlobalVisStates will not have been +# updated with the new horizon until something forces them to be updated. +# +# _bt_pendingfsm_finalize() calls GetOldestNonRemovableTransactionId() at the +# end of a round of index vacuuming, updating the backend's GlobalVisState +# and, in our case, moving maybe_needed backwards. +# +# Then vacuum's first pass will continue and pruning will find our later +# inserted and updated tuple HEAPTUPLE_RECENTLY_DEAD when compared to +# maybe_needed but HEAPTUPLE_DEAD when compared to OldestXmin. +$node_primary->safe_psql($test_db, qq[ + CREATE TABLE ${table1}(col1 int) with (autovacuum_enabled=false, fillfactor=10); + INSERT INTO $table1 VALUES(7); + INSERT INTO $table1 SELECT generate_series(1, 40) % 3; + CREATE INDEX on ${table1}(col1); + UPDATE $table1 SET col1 = 3 WHERE col1 = 0; + INSERT INTO $table1 VALUES(7); +]); + +# We will later move
Re: Allow logical failover slots to wait on synchronous replication
Hi, Thanks Bertrand for taking a look at the patch. On Mon, Jun 17, 2024 at 8:19 AM Bertrand Drouvot wrote: > > + int mode = SyncRepWaitMode; > > It's set to SyncRepWaitMode and then never change. Worth to get rid of "mode"? > I took a deeper look at this with GDB and I think it's necessary to cache the value of "mode". We first check: if (mode == SYNC_REP_NO_WAIT) return true; However after this check it's possible to receive a SIGHUP changing SyncRepWaitMode to SYNC_REP_NO_WAIT (e.g. synchronous_commit = 'on' -> 'off'), leading to lsn[-1]. > 2 > > + static XLogRecPtr lsn[NUM_SYNC_REP_WAIT_MODE] = > {InvalidXLogRecPtr}; > > I did some testing and saw that the lsn[] values were not always set to > InvalidXLogRecPtr right after. It looks like that, in that case, we should > avoid setting the lsn[] values at compile time. Then, what about? > > 1. remove the "static". > > or > > 2. keep the static but set the lsn[] values after its declaration. I'd prefer to keep the static because it reduces unnecessary contention on SyncRepLock if logical client has fallen behind. I'll add a change with your second suggestion. > 3 > > - /* > -* Return false if not all the standbys have caught up to the > specified > -* WAL location. > -*/ > - if (caught_up_slot_num != standby_slot_names_config->nslotnames) > - return false; > + if (!XLogRecPtrIsInvalid(lsn[mode]) && lsn[mode] >= > wait_for_lsn) > + return true; > > lsn[] values are/(should have been, see 2 above) just been initialized to > InvalidXLogRecPtr so that XLogRecPtrIsInvalid(lsn[mode]) will always return > true. I think this check is not needed then. Removed. > Out of curiosity, did you compare with standby_slot_names_from_syncrep set to > off > and standby_slot_names not empty? I didn't think 'standby_slot_names' would impact TPS as much since it's not grabbing the SyncRepLock but here's a quick test. Writer with 5 synchronous replicas, 10 pg_recvlogical clients and pgbench all running from the same server. Command: pgbench -c 4 -j 4 -T 600 -U "ec2-user" -d postgres -r -P 5 Result with: standby_slot_names = 'replica_1,replica_2,replica_3,replica_4,replica_5' latency average = 5.600 ms latency stddev = 2.854 ms initial connection time = 5.503 ms tps = 714.148263 (without initial connection time) Result with: standby_slot_names_from_syncrep = 'true', synchronous_standby_names = 'ANY 3 (A,B,C,D,E)' latency average = 5.740 ms latency stddev = 2.543 ms initial connection time = 4.093 ms tps = 696.776249 (without initial connection time) Result with nothing set: latency average = 5.090 ms latency stddev = 3.467 ms initial connection time = 4.989 ms tps = 785.665963 (without initial connection time) Again I think it's possible to improve the synchronous numbers if we cache but I'll try that out in a bit. -- John Hsu - Amazon Web Services
Re: Allow logical failover slots to wait on synchronous replication
Hi Amit, Thanks for taking a look. On Tue, Jun 18, 2024 at 10:34 PM Amit Kapila wrote: > > > The reading indicates when you set 'standby_slot_names_from_syncrep', > the TPS reduces as compared to when it is not set. It would be better > to see the data comparing 'standby_slot_names_from_syncrep' and the > existing parameter 'standby_slot_names'. I added new benchmark numbers in the reply to Bertrand, but I'll include in this thread for posterity. Writer with 5 synchronous replicas, 10 pg_recvlogical clients and pgbench all running from the same server. Command: pgbench -c 4 -j 4 -T 600 -U "ec2-user" -d postgres -r -P 5 Result with: standby_slot_names = 'replica_1,replica_2,replica_3,replica_4,replica_5' latency average = 5.600 ms latency stddev = 2.854 ms initial connection time = 5.503 ms tps = 714.148263 (without initial connection time) Result with: standby_slot_names_from_syncrep = 'true', synchronous_standby_names = 'ANY 3 (A,B,C,D,E)' latency average = 5.740 ms latency stddev = 2.543 ms initial connection time = 4.093 ms tps = 696.776249 (without initial connection time) Result with nothing set: latency average = 5.090 ms latency stddev = 3.467 ms initial connection time = 4.989 ms tps = 785.665963 (without initial connection time) > Can we make it a default > behavior that logical slots marked with a failover option will wait > for 'synchronous_standby_names' as per your patch's idea unless > 'standby_slot_names' is specified? I don't know if there is any value > in setting the 'failover' option for a slot without specifying > 'standby_slot_names', so was wondering if we can additionally tie it > to 'synchronous_standby_names'. Any better ideas? > No, I think that works pretty cleanly actually. Reserving some special keyword isn't great which is the only other thing I can think of. I've updated the patch and tests to reflect that. Attached the patch that addresses these changes. -- John Hsu - Amazon Web Services 0002-Wait-on-synchronous-replication-by-default-for-logic.patch Description: Binary data
Re: allow changing autovacuum_max_workers without restarting
Here is a rebased patch. One thing that still bugs me is that there is no feedback sent to the user when autovacuum_max_workers is set higher than autovacuum_worker_slots. I think we should at least emit a WARNING, perhaps from the autovacuum launcher, i.e., once when the launcher starts and then again as needed via HandleAutoVacLauncherInterrupts(). Or we could fail to start in PostmasterMain() and then ignore later misconfigurations via a GUC check hook. I'm not too thrilled about adding more GUC check hooks that depend on the value of other GUCs, but I do like the idea of failing instead of silently proceeding with a different value than the user configured. Any thoughts? -- nathan >From bd486d1ab302c4654b9cfbc57230bcf9b140711e Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Sat, 22 Jun 2024 15:05:44 -0500 Subject: [PATCH v7 1/1] allow changing autovacuum_max_workers without restarting --- doc/src/sgml/config.sgml | 28 +++- doc/src/sgml/runtime.sgml | 12 ++--- src/backend/access/transam/xlog.c | 2 +- src/backend/postmaster/autovacuum.c | 44 --- src/backend/postmaster/postmaster.c | 2 +- src/backend/storage/lmgr/proc.c | 6 +-- src/backend/utils/init/postinit.c | 6 +-- src/backend/utils/misc/guc_tables.c | 11 - src/backend/utils/misc/postgresql.conf.sample | 3 +- src/include/postmaster/autovacuum.h | 1 + src/test/perl/PostgreSQL/Test/Cluster.pm | 1 + 11 files changed, 83 insertions(+), 33 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index f627a3e63c..da30d1ea4f 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -8544,6 +8544,25 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv; + + autovacuum_worker_slots (integer) + + autovacuum_worker_slots configuration parameter + + + + +Specifies the number of backend slots to reserve for autovacuum worker +processes. The default is 16. This parameter can only be set at server +start. + + +When changing this value, consider also adjusting +. + + + + autovacuum_max_workers (integer) @@ -8554,7 +8573,14 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv; Specifies the maximum number of autovacuum processes (other than the autovacuum launcher) that may be running at any one time. The default -is three. This parameter can only be set at server start. +is three. This parameter can only be set in the +postgresql.conf file or on the server command line. + + +Note that a setting for this value which is higher than + will have no effect, +since autovacuum workers are taken from the pool of slots established +by that setting. diff --git a/doc/src/sgml/runtime.sgml b/doc/src/sgml/runtime.sgml index 2f7c618886..4bb37faffe 100644 --- a/doc/src/sgml/runtime.sgml +++ b/doc/src/sgml/runtime.sgml @@ -781,13 +781,13 @@ psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: No such SEMMNI Maximum number of semaphore identifiers (i.e., sets) -at least ceil((max_connections + autovacuum_max_workers + max_wal_senders + max_worker_processes + 7) / 16) plus room for other applications +at least ceil((max_connections + autovacuum_worker_slots + max_wal_senders + max_worker_processes + 7) / 16) plus room for other applications SEMMNS Maximum number of semaphores system-wide -ceil((max_connections + autovacuum_max_workers + max_wal_senders + max_worker_processes + 7) / 16) * 17 plus room for other applications +ceil((max_connections + autovacuum_worker_slots + max_wal_senders + max_worker_processes + 7) / 16) * 17 plus room for other applications @@ -838,7 +838,7 @@ psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: No such When using System V semaphores, PostgreSQL uses one semaphore per allowed connection (), allowed autovacuum worker process -(), allowed WAL sender process +(), allowed WAL sender process (), and allowed background process (), in sets of 16. Each such set will @@ -847,13 +847,13 @@ psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: No such other applications. The maximum number of semaphores in the system is set by SEMMNS, which consequently must be at least as high as max_connections plus -autovacuum_max_workers plus max_wal_senders, +autovacuum_worker_slots plus max_wal_senders, plus max_worker_processes, plus one extra for each 16 allowed connections plus workers (see t
Re: Improve the connection failure error messages
Nisha Moond writes: > Attached v5 patch with the translator comments as suggested. I looked at this, and I agree with the goal, but I find just about all of the translator comments unnecessary. The ones that are useful are useful only because the message is violating one of our message style guidelines [1]: When citing the name of an object, state what kind of object it is. Rationale: Otherwise no one will know what “foo.bar.baz” refers to. So, for example, where you have + +/* + * translator: first %s is the subscription name, second %s is the + * error + */ + errmsg("subscription \"%s\" could not connect to the publisher: %s", stmt->subname, err))); I don't find that that translator comment is adding anything. But there are a couple of places like +/* + * translator: first %s is the slotsync worker name, second %s is the + * error + */ +errmsg("\"%s\" could not connect to the primary server: %s", app_name.data, err)); I think that the right cure for the ambiguity here is not to add a translator comment, but to label the name properly, perhaps like errmsg("synchronization worker \"%s\" could not connect to the primary server: %s", app_name.data, err)); regards, tom lane [1] https://www.postgresql.org/docs/devel/error-style-guide.html#ERROR-STYLE-GUIDE-OBJECT-TYPE
Re: Add a GUC check hook to ensure summarize_wal cannot be enabled when wal_level is minimal
On Thu, Jul 4, 2024 at 10:35 AM Fujii Masao wrote: > +1 for v18 or later. However, since the reported issue is in v17, > it needs to be addressed without such a improved check mechanism. Here is a draft patch for that. This is only lightly tested at this point, so further testing would be greatly appreciated, as would code review. NOTE THAT this seems to require bumping XLOG_PAGE_MAGIC and PG_CONTROL_VERSION, which I imagine won't be super-popular considering we've already shipped beta2. However, I don't see another principled way to fix the reported problem. We do currently log an XLOG_PARAMETER_CHANGE message at startup when wal_level has changed, but that's useless for this purpose. Consider that WAL summarization on a standby can start from any XLOG_CHECKPOINT_SHUTDOWN or XLOG_CHECKPOINT_REDO record, and the most recent XLOG_PARAMETER_CHANGE may be arbitrarily far behind that point, and replay may be arbitrarily far ahead of that point by which things may have changed again. The only way to know for certain what wal_level was in effect at the time one of those records was generated is if the record itself contains that information, so that's what I did. If I'm reading the code correctly, this doesn't increase the size of the WAL, because XLOG_CHECKPOINT_REDO was previously storing a dummy integer, and CheckPoint looks to have an alignment padding hole where I inserted the new member. Even if the size did increase, I don't think it would matter: these records aren't emitted frequently enough for it to be a problem. But I think it doesn't. However, it does change the format of WAL, and because the checkpoint data is also stored in the control file, it changes the format of that, too. If we really, really don't want to let those values change at this point in the release cycle, then we could alternatively just document the kinds of scenarios that this protects against as unsupported and admonish people sternly not to do that kind of thing. I think it's actually sort of rare, because you have to do something like: enable WAL summarization, do some stuff, restart with wal_level=minimal and summarize_wal=off, do some more stuff but not so much stuff that any of the WAL you generate gets removed, then restart again with wal_level=replica/logical and summarize_wal=on again, so that the WAL generated at the lower WAL level gets summarized before it gets removed. Then, an incremental backup that you took after doing that dance against a prior backup taken before doing all that might get confused about what to do. Although that does not sound like a terribly likely or advisable sequence of steps, I bet some people will do it and then it will be hard to figure out what went wrong (and even after we do, it won't make those people whole). And on the flip side I don't think that bumping XLOG_PAGE_MAGIC or PG_CONTROL_VERSION will cause huge inconvenience, because people don't tend to put beta2 into production -- the early adopters tend to go for .0 or .1, not betaX. -- Robert Haas EDB: http://www.enterprisedb.com v1-0001-Do-not-summarize-WAL-if-generated-with-wal_level-.patch Description: Binary data
Re: ssl tests fail due to TCP port conflict
On 2024-07-08 Mo 8:00 AM, Alexander Lakhin wrote: Hello, 07.06.2024 17:25, Tom Lane wrote: Andrew Dunstan writes: I still think my patch to force TCP mode for the SSL test makes sense as well. +1 to both things. If that doesn't get the failure rate down to an acceptable level, we can look at the retry idea. I have push patches for both of those (i.e. start SSL test nodes in TCP mode and change the range of ports we allocate server ports from) I didn't see this email until after I had pushed them. I'd like to add that the kerberos/001_auth test also suffers from the port conflict, but slightly differently. Look for example at [1]: krb5kdc.log contains: Jul 02 09:29:41 andres-postgres-buildfarm-v5 krb5kdc[471964](info): setting up network... Jul 02 09:29:41 andres-postgres-buildfarm-v5 krb5kdc[471964](Error): Address already in use - Cannot bind server socket on 127.0.0.1.55853 Jul 02 09:29:41 andres-postgres-buildfarm-v5 krb5kdc[471964](Error): Failed setting up a UDP socket (for 127.0.0.1.55853) Jul 02 09:29:41 andres-postgres-buildfarm-v5 krb5kdc[471964](Error): Address already in use - Error setting up network As far as I can see, the port for kdc is chosen by PostgreSQL::Test::Kerberos, via PostgreSQL::Test::Cluster::get_free_port(), which checks only for TCP port availability (with can_bind()), but not for UDP, so this increases the probability of the conflict for this test (a similar failure: [2]). Although we can also find a failure with TCP: [3] (It's not clear to me, what processes can use UDP ports while testing, but maybe those buildfarm animals are running on the same logical machine simultaneously?) [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2024-07-02%2009%3A27%3A15 [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2024-05-15%2001%3A25%3A07 [3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2024-07-04%2008%3A28%3A19 Let's see if this persists now we are testing for free ports in a different range, not the range usually used for ephemeral ports. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Should we work around msvc failing to compile tab-complete.c?
Hi, On 2024-07-08 14:18:03 -0400, Tom Lane wrote: > Andres Freund writes: > > Compiling postgres on windows with tab-completion support fails either with > > "fatal error C1026: parser stack overflow, program too complex”. > > or (in recent versions) with > > "…/src/bin/psql/tab-complete.c(4023): fatal error C1001: Internal compiler > > error." > > > It's pretty easy to work around the error [2]. I wonder if we should just do > > that, then we don't have to insist on a very new msvc version being used and > > it'll work even if they just decide to fix the internal compiler error. > > I'm on board with doing something here, but wouldn't we want to > back-patch at least a minimal fix to all supported branches? I think we'd need to backpatch more for older branches. At least commit 3f28bd7337d Author: Thomas Munro Date: 2022-12-22 17:14:23 +1300 Add work-around for VA_ARGS_NARGS() on MSVC. Given that - afaict - tab completion never used to work with msvc, I think it'd be ok to just do it in 17 or 16+17 or such. Obviously nobody is currently building with readline support for windows - not sure if any packager is going to go back and add support for it in older branches. > As for the specific thing to do, that long if-chain seems horrid > from an efficiency standpoint as well as stressing compiler > implementations. Indeed. Even with gcc it's one of the slowest files to compile in our codebase. At -O2 tab-complete.c takes 7.3s with gcc 14. Just adding an else if (HeadMatches("ALTER")) { } around all the ALTER branches reduces that to 4.5s to me. Doing that for COMMENT and CREATE gets down to 3.2s. > I realize that this pretty much only needs to run > at human-reaction-time speed, but it still offends my inner nerd. Same. > Could we perhaps have a table of first words of each interesting > match, and do a lookup in that before dispatching to code segments > that are individually similar to what's there now? Eventually it seems yet, it ought to be table driven in some form. But I wonder if that's setting the bar too high for now. Even just doing some manual re-nesting seems like it could get us quite far? I'm thinking of something like an outer if-else-if chain for CREATE, ALTER, DROP, COMMENT and everything that doesn't fit within those (e.g. various TailMatches(), backslash command etc) we'd have reduced the number of redundant checks a lot. The amount of whitespace changes that'd imply isn't great, but I don't really see a way around that. Greetings, Andres Freund
Re: tests fail on windows with default git settings
On 2024-07-07 06:30:57 -0400, Andrew Dunstan wrote: > > On 2024-07-07 Su 1:26 AM, Tom Lane wrote: > > Andres Freund writes: > > > Do we want to support checking out with core.autocrlf? > > -1. It would be a constant source of breakage, and you could never > > expect that (for example) making a tarball from such a checkout > > would match anyone else's results. > Yeah, totally agree. > > > > > If we do not want to support that, ISTM we ought to raise an error > > > somewhere? > > +1, if we can figure out how. > > > > > > > > ISTM the right fix is probably to use PG_BINARY_R mode instead of "r" when > opening the files, at least in the case if the test_json_parser tests. That does seem like it'd fix this issue, assuming the parser can cope with \r\n. I'm actually mildly surprised that the tests don't fail when *not* using autocrlf, because afaict test_json_parser_incremental.c doesn't set stdout to binary and thus we presumably end up with \r\n in the output? Except that that can't be true, because the test does pass on repos without autocrlf... That approach does seem to mildly conflict with Tom and your preference for fixing this by disallowing core.autocrlf? If we do so, the test never ought to see a crlf? Greetings, Andres Freund
Re: Interrupts vs signals
On Mon, Jul 8, 2024 at 5:38 AM Heikki Linnakangas wrote: > Another approach would be to move the responsibility of background > worker state notifications out of postmaster completely. When a new > background worker is launched, the worker process itself could send the > notification that it has started. And similarly, when a worker exits, it > could send the notification just before exiting. There's a little race > condition with exiting: if a process is waiting for the bgworker to > exit, and launches a new worker immediately when the old one exits, > there will be a brief period when the old and new process are alive at > the same time. The old worker wouldn't be doing anything interesting > anymore since it's exiting, but it still counts towards > max_worker_processes, so launching the new process might fail because of > hitting the limit. Maybe we should just bump up max_worker_processes. Or > postmaster could check PMChildFlags and not count processes that have > already deregistered from PMChildFlags towards the limit. I can testify that the current system is the result of a lot of trial and error. I'm not saying it can't be made better, but my initial attempts at getting this to work (back in the 9.4 era) resembled what you proposed here, were consequently a lot simpler than what we have now, and also did not work. Race conditions like you mention here were part of that. Another consideration is that fork() can fail, and in that case, the process that tried to register the new background worker needs to find out that the background worker won't ever be starting. Yet another problem is that, even if fork() succeeds, the new process might fail before it executes any of our code e.g. because it seg faults very early, a case that actually happened to me - inadvertently - while I was testing these facilities. I ended up deciding that we can't rely on the new process to do anything until it's given us some signal that it is alive and able to carry out its duties. If it dies before telling us that, or never starts in the first place, we have to have some other way of finding that out, and it's difficult to see how that can happen without postmaster involvement. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Should we work around msvc failing to compile tab-complete.c?
Andres Freund writes: > Given that - afaict - tab completion never used to work with msvc, I think > it'd be ok to just do it in 17 or 16+17 or such. Obviously nobody is currently > building with readline support for windows - not sure if any packager is going > to go back and add support for it in older branches. If it doesn't even compile in existing branches, that seems to me to be plenty enough reason to call it a new HEAD-only feature. That gives us some leeway for experimentation and a more invasive patch than we might want to risk in stable branches. > On 2024-07-08 14:18:03 -0400, Tom Lane wrote: >> Could we perhaps have a table of first words of each interesting >> match, and do a lookup in that before dispatching to code segments >> that are individually similar to what's there now? > Eventually it seems yet, it ought to be table driven in some form. But I > wonder if that's setting the bar too high for now. Even just doing some manual > re-nesting seems like it could get us quite far? What I'm thinking is that (as you say) any fix that's not as ugly as Kirk's hack is going to involve massive reindentation, with corresponding breakage of any pending patches. It would be a shame to do that twice, so I'd rather look for a long-term fix instead of putting in a stop-gap. Let me play with the "table" idea a little bit and see what I get. regards, tom lane
Re: 010_pg_basebackup.pl vs multiple filesystems
Hi, On 2024-07-07 09:10:48 -0400, Andrew Dunstan wrote: > On 2024-07-07 Su 7:28 AM, Andrew Dunstan wrote: > > I'll be happy to hear of one. I agree it's a mess. Maybe we could test > > that the temp directory is on the same device on Windows and skip the > > test if not? You could still get the test to run by setting TMPDIR > > and/or friends. > Maybe we should just not try to rename the directory. Looking at the test > I'm pretty sure the directory should be empty. Instead of trying to move it, > let's just remove it, and recreate it in the tmp location. Good catch, yes, that'd be much better! > diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl > b/src/bin/pg_basebackup/t/010_pg_basebackup.pl > index 489dde4adf..c0c334c6fc 100644 > --- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl > +++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl > @@ -363,8 +363,8 @@ my $sys_tempdir = > PostgreSQL::Test::Utils::tempdir_short; > # Elsewhere use $tempdir to avoid file system boundary issues with moving. > my $tmploc = $windows_os ? $sys_tempdir : $tempdir; The comment would need a bit of editing, I guess. I think we should consider just getting rid of the os-dependant switch now, it shouldn't be needed anymore? > -rename("$pgdata/pg_replslot", "$tmploc/pg_replslot") > - or BAIL_OUT "could not move $pgdata/pg_replslot"; > +rmtree("$pgdata/pg_replslot"); Should this perhaps be an rmdir, to ensure that we're not removing something we don't want (e.g. somebody adding an earlier test for slots that then gets broken by the rmtree)? Greetings, Andres Freund
Re: tests fail on windows with default git settings
Andres Freund writes: > On 2024-07-07 06:30:57 -0400, Andrew Dunstan wrote: >> ISTM the right fix is probably to use PG_BINARY_R mode instead of "r" when >> opening the files, at least in the case if the test_json_parser tests. > That approach does seem to mildly conflict with Tom and your preference for > fixing this by disallowing core.autocrlf? If we do so, the test never ought to > see a crlf? Is this code that will *never* be applied to user-supplied files? We certainly should tolerate \r\n in the general case (we even have written-down project policy about that!). While I wouldn't complain too hard about assuming that our own test files don't contain \r\n, if the code might get copied into a non-test scenario then it could create problems later. regards, tom lane
Re: 010_pg_basebackup.pl vs multiple filesystems
On Sun, Jul 7, 2024 at 3:02 AM Andres Freund wrote: > While working on [1] I encountered the issue that, on github-actions, > 010_pg_basebackup.pl fails on windows. > > The reason for that is that github actions uses two drives, with TMP/TEMP > located on c:, the tested code on d:. This causes the following code to fail: > > # Create a temporary directory in the system location. > my $sys_tempdir = PostgreSQL::Test::Utils::tempdir_short; Whatever we end up doing about this, it would be a good idea to check the other places that use tempdir_short and see if they also need adjustment. -- Robert Haas EDB: http://www.enterprisedb.com
Re: tests fail on windows with default git settings
On 2024-07-08 Mo 4:16 PM, Andres Freund wrote: On 2024-07-07 06:30:57 -0400, Andrew Dunstan wrote: On 2024-07-07 Su 1:26 AM, Tom Lane wrote: Andres Freund writes: Do we want to support checking out with core.autocrlf? -1. It would be a constant source of breakage, and you could never expect that (for example) making a tarball from such a checkout would match anyone else's results. Yeah, totally agree. If we do not want to support that, ISTM we ought to raise an error somewhere? +1, if we can figure out how. ISTM the right fix is probably to use PG_BINARY_R mode instead of "r" when opening the files, at least in the case if the test_json_parser tests. That does seem like it'd fix this issue, assuming the parser can cope with \r\n. Yes, the parser can handle \r\n. Note that they can only be white space in JSON - they can only be present in string values via escapes. I'm actually mildly surprised that the tests don't fail when *not* using autocrlf, because afaict test_json_parser_incremental.c doesn't set stdout to binary and thus we presumably end up with \r\n in the output? Except that that can't be true, because the test does pass on repos without autocrlf... That approach does seem to mildly conflict with Tom and your preference for fixing this by disallowing core.autocrlf? If we do so, the test never ought to see a crlf? IDK. I normally use core.autocrlf=false core.eol=lf on Windows. The editors I use are reasonably well behaved ;-) What I suggest (see attached) is we run the diff command with --strip-trailing-cr on Windows. Then we just won't care if the expected file and/or the output file has CRs. Not sure what the issue is with pg_bsd_indent, though. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com diff --git a/src/test/modules/test_json_parser/t/003_test_semantic.pl b/src/test/modules/test_json_parser/t/003_test_semantic.pl index 74e0fa5bb1..2a430ad0c4 100644 --- a/src/test/modules/test_json_parser/t/003_test_semantic.pl +++ b/src/test/modules/test_json_parser/t/003_test_semantic.pl @@ -29,7 +29,9 @@ print $fh $stdout, "\n"; close($fh); -($stdout, $stderr) = run_command([ "diff", "-u", $fname, $test_out ]); +my @diffopts = ("-u"); +push(@diffopts, "--strip-railing-cr") if $windows_os; +($stdout, $stderr) = run_command([ "diff", @diffopts, $fname, $test_out ]); is($stdout, "", "no output diff"); is($stderr, "", "no diff error"); diff --git a/src/test/modules/test_json_parser/test_json_parser_incremental.c b/src/test/modules/test_json_parser/test_json_parser_incremental.c index f4c442ac36..47040e1e42 100644 --- a/src/test/modules/test_json_parser/test_json_parser_incremental.c +++ b/src/test/modules/test_json_parser/test_json_parser_incremental.c @@ -124,7 +124,7 @@ main(int argc, char **argv) makeJsonLexContextIncremental(&lex, PG_UTF8, need_strings); initStringInfo(&json); - if ((json_file = fopen(testfile, "r")) == NULL) + if ((json_file = fopen(testfile, PG_BINARY_R)) == NULL) pg_fatal("error opening input: %m"); if (fstat(fileno(json_file), &statbuf) != 0) diff --git a/src/test/modules/test_json_parser/test_json_parser_perf.c b/src/test/modules/test_json_parser/test_json_parser_perf.c index ea85626cbd..74cc5f3f54 100644 --- a/src/test/modules/test_json_parser/test_json_parser_perf.c +++ b/src/test/modules/test_json_parser/test_json_parser_perf.c @@ -55,7 +55,7 @@ main(int argc, char **argv) sscanf(argv[1], "%d", &iter); - if ((json_file = fopen(argv[2], "r")) == NULL) + if ((json_file = fopen(argv[2], PG_BINARY_R)) == NULL) pg_fatal("Could not open input file '%s': %m", argv[2]); while ((n_read = fread(buff, 1, 6000, json_file)) > 0)
Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible
On Sun, Jun 30, 2024 at 10:48 AM Noah Misch wrote: > That looks like a reasonable user experience. Is any field newly-nullable? Technically I think the answer is no, since backends such as walwriter already have null database and user fields. It's new for a client backend to have nulls there, though. > That said, it > may be more fruitful to arrange for authentication timeout to cut through PAM > etc. That seems mostly out of our hands -- the misbehaving modules are free to ignore our signals (and do). Is there another way to force the issue? > Hanging connection slots hurt even if they lack an xmin. Oh, would releasing the xmin not really move the needle, then? > I assume it > takes an immediate shutdown to fix them? That's my understanding, yeah. > > Would anyone like me to be more aggressive, and create a pgstat entry > > as soon as we have the opening transaction? Or... as soon as a > > connection is made? > > All else being equal, I'd like backends to have one before taking any lmgr > lock or snapshot. I can look at this for the next patchset version. > > I haven't decided how to test these patches. Seems like a potential > > use case for injection points, but I think I'd need to preload an > > injection library rather than using the existing extension. Does that > > seem like an okay way to go? > > Yes. I misunderstood how injection points worked. No preload module needed, so v2 adds a waitpoint and a test along with a couple of needed tweaks to BackgroundPsql. I think 0001 should probably be applied independently. Thanks, --Jacob v2-0001-BackgroundPsql-handle-empty-query-results.patch Description: Binary data v2-0002-Test-Cluster-let-background_psql-work-asynchronou.patch Description: Binary data v2-0004-WIP-report-external-auth-calls-as-wait-events.patch Description: Binary data v2-0003-pgstat-report-in-earlier-with-STATE_AUTHENTICATIN.patch Description: Binary data
Re: 010_pg_basebackup.pl vs multiple filesystems
On 2024-07-08 Mo 4:45 PM, Robert Haas wrote: On Sun, Jul 7, 2024 at 3:02 AM Andres Freund wrote: While working on [1] I encountered the issue that, on github-actions, 010_pg_basebackup.pl fails on windows. The reason for that is that github actions uses two drives, with TMP/TEMP located on c:, the tested code on d:. This causes the following code to fail: # Create a temporary directory in the system location. my $sys_tempdir = PostgreSQL::Test::Utils::tempdir_short; Whatever we end up doing about this, it would be a good idea to check the other places that use tempdir_short and see if they also need adjustment. I don't think it's a problem. There are lots of systems that have tempdir on a different device. That's why we previously saw lots of errors from this code, resulting in the present incomplete workaround. The fact that we haven't seen such errors from other tests means we should be ok. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade
Committed. -- nathan
Re: 010_pg_basebackup.pl vs multiple filesystems
Hi, On 2024-07-08 16:45:42 -0400, Robert Haas wrote: > On Sun, Jul 7, 2024 at 3:02 AM Andres Freund wrote: > > While working on [1] I encountered the issue that, on github-actions, > > 010_pg_basebackup.pl fails on windows. > > > > The reason for that is that github actions uses two drives, with TMP/TEMP > > located on c:, the tested code on d:. This causes the following code to > > fail: > > > > # Create a temporary directory in the system location. > > my $sys_tempdir = PostgreSQL::Test::Utils::tempdir_short; > > Whatever we end up doing about this, it would be a good idea to check > the other places that use tempdir_short and see if they also need > adjustment. FWIW, this was the only test that failed due to TMP being on a separate partition. I couldn't quite enable all tests (PG_TEST_EXTRA including libpq_encryption fails due to some gssapi issue, kerberos fails due to paths being wrong / not set up for windows, load_balance because I haven't set up hostnames), but I don't think any of the issues around those tests are related. I did also grep for other uses of tmpdir_short, they all seem to be differently motivated. I also looked for uses of rename() in tests. While src/bin/pg_verifybackup/t/007_wal.pl does move stuff around, it afaict is always going to be on the same filesystem. Greetings, Andres Freund
Re: Document use of ldapurl with LDAP simple bind
On Fri, Jun 28, 2024 at 12:11 AM Peter Eisentraut wrote: > This appears to imply that specifying ldapurl is only applicable for > search+bind. Maybe that whole message should be simplified to something > like > > "configuration mixes arguments for simple bind and search+bind" > > (The old wording also ignores that the error might arise via "ldapsuffix".) I kept the imperative "cannot" and tried to match the terminology with our documentation at [1]: cannot mix options for simple bind and search+bind modes WDYT? --Jacob [1] https://www.postgresql.org/docs/17/auth-ldap.html v2-0001-docs-explain-how-to-use-ldapurl-with-simple-bind.patch Description: Binary data v2-0002-ldap-test-ldapurl-with-simple-bind.patch Description: Binary data v2-0003-hba-improve-error-when-mixing-LDAP-bind-modes.patch Description: Binary data
Re: tests fail on windows with default git settings
Hi, On 2024-07-08 16:56:10 -0400, Andrew Dunstan wrote: > On 2024-07-08 Mo 4:16 PM, Andres Freund wrote: > > I'm actually mildly surprised that the tests don't fail when *not* using > > autocrlf, because afaict test_json_parser_incremental.c doesn't set stdout > > to > > binary and thus we presumably end up with \r\n in the output? Except that > > that > > can't be true, because the test does pass on repos without autocrlf... > > > > > > That approach does seem to mildly conflict with Tom and your preference for > > fixing this by disallowing core.autocrlf? If we do so, the test never ought > > to > > see a crlf? > > > > IDK. I normally use core.autocrlf=false core.eol=lf on Windows. The editors > I use are reasonably well behaved ;-) :) > What I suggest (see attached) is we run the diff command with > --strip-trailing-cr on Windows. Then we just won't care if the expected file > and/or the output file has CRs. I was wondering about that too, but I wasn't sure we can rely on that flag being supported... > Not sure what the issue is with pg_bsd_indent, though. I think it's purely that we *read* with fopen("r") and write with fopen("wb"). Which means that any \r\n in the input will be converted to \n in the output. That's not a problem if the repo has been cloned without autocrlf, as there are no crlf in the expected files, but if autocrlf has been used, the expected files don't match. It doesn't look like it'd be trivial to make indent remember what was used in the input. So I think for now the best path is to just use .gitattributes to exclude the expected files from crlf conversion. If we don't want to do so repo wide, we can do so just for these files. Greetings, Andres Freund