On Thu, May 01, 2025 at 12:15:30PM -0700, Masahiko Sawada wrote: > In light of these concerns, I've been contemplating alternative > interface designs. One promising approach would involve registering > custom copy formats via a C function during module loading > (specifically, in _PG_init()). This method would require extension > authors to invoke a registration function, say > RegisterCustomCopyFormat(), in _PG_init() as follows: > > JsonLinesFormatId = RegisterCustomCopyFormat("jsonlines", > &JsonLinesCopyToRoutine, > &JsonLinesCopyFromRoutine); > > The registration function would validate the format name and store it > in TopMemoryContext. It would then return a unique identifier that can > be used subsequently to reference the custom copy format extension.
Hmm. How much should we care about the observability of the COPY format used by a given backend? Storing this information in a backend's TopMemoryContext is OK to get the extensibility basics to work, but could it make sense to use some shmem state to allocate a uint32 ID that could be shared by all backends. Contrary to EXPLAIN, COPY commands usually run for a very long time, so I am wondering if these APIs should be designed so as it would be possible to monitor the format used. One layer where the format information could be made available is the progress reporting view for COPY, for example. I can also imagine a pgstats kind where we do COPY stats aggregates, with a per-format pgstats kind, and sharing a fixed ID across multiple backends is relevant (when flushing the stats at shutdown, we would use a name/ID mapping like replication slots). I don't think that this needs to be relevant for the option part, just for the format where, I suspect, we should store in a shmem array based on the ID allocated the name of the format, the library of the callback and the function name fed to load_external_function(). Note that custom LWLock and wait events use a shmem state for monitoring purposes, where we are able to do ID->format name lookups as much as format->ID lookups. Perhaps it's OK not to do that for COPY, but I am wondering if we'd better design things from scratch with states in shmem state knowing that COPY is a long-running operation, and that if one mixes multiple formats they would most likely want to know which formats are bottlenecks, through SQL. Cloud providers would love that. > This approach offers several advantages: it would eliminate the > search_path issue, provide greater flexibility, and potentially > simplify the overall interface for users and developers alike. We > might be able to provide a view showing the registered custom COPY > format in the future. Also, these interfaces align with other > customizable functionalities such as custom rmgr, custom lwlock, > custom waitevent, and custom EXPLAIN option etc. Yeah, agreed with the search_path concerns. We are getting better at making areas of Postgres more pluggable lately, having a loading path where we don't have any of these potential issues by design matters. -- Michael
signature.asc
Description: PGP signature