On Fri, Oct 18, 2019 at 03:25:05AM -0700, Andres Freund wrote:
Hi,
On 2019-10-17 12:47:47 -0300, Alvaro Herrera wrote:
On 2019-Oct-10, Ildar Musin wrote:
> 1. Unlike FDW API, in pluggable storage API there are no routines like
> "begin modify table" and "end modify table" and there is no shared
> state between insert/update/delete calls.
Hmm. I think adding a begin/end to modifytable is a reasonable thing to
do (it'd be a no-op for heap and zheap I guess).
I'm fairly strongly against that. Adding two additional "virtual"
function calls for something that's rarely going to be used, seems like
adding too much overhead to me.
That seems a bit strange to me. Sure - if there's an alternative way to
achieve the desired behavior (clear way to finalize writes etc.), then
cool, let's do that. But forcing people to use invonvenient workarounds
seems like a bad thing to me - having a convenient and clear API is
quite valueable, IMHO.
Let's see if this actually has a measuerable overhead first.
> 2. It looks like I cannot implement custom storage options. E.g. for
> compressed storage it makes sense to implement different compression
> methods (lz4, zstd etc.) and corresponding options (like compression
> level). But as i can see storage options (like fillfactor etc) are
> hardcoded and are not extensible. Possible solution is to use GUCs
> which would work but is not extremely convinient.
Yeah, the reloptions module is undergoing some changes. I expect that
there will be a way to extend reloptions from an extension, at the end
of that set of patches.
Cool.
Yep.
> 3. A bit surprising limitation that in order to use bitmap scan the
> maximum number of tuples per page must not exceed 291 due to
> MAX_TUPLES_PER_PAGE macro in tidbitmap.c which is calculated based on
> 8kb page size. In case of 1mb page this restriction feels really
> limiting.
I suppose this is a hardcoded limit that needs to be fixed by patching
core as we make table AM more pervasive.
That's not unproblematic - a dynamic limit would make a number of
computations more expensive, and we already spend plenty CPU cycles
building the tid bitmap. And we'd waste plenty of memory just having all
that space for the worst case. ISTM that we "just" need to replace the
TID bitmap with some tree like structure.
I think the zedstore has roughly the same problem, and Heikki mentioned
some possible solutions to dealing with it in his pgconfeu talk (and it
was discussed in the zedstore thread, I think).
> 4. In order to use WAL-logging each page must start with a standard 24
> byte PageHeaderData even if it is needless for storage itself. Not a
> big deal though. Another (acutally documented) WAL-related limitation
> is that only generic WAL can be used within extension. So unless
> inserts are made in bulks it's going to require a lot of disk space to
> accomodate logs and wide bandwith for replication.
Not sure what to suggest. Either you should ignore this problem, or
you should fix it.
I think if it becomes a problem you should ask for an rmgr ID to use for
your extension, which we encode and then then allow to set the relevant
rmgr callbacks for that rmgr id at startup. But you should obviously
first develop the WAL logging etc, and make sure it's beneficial over
generic wal logging for your case.
AFAIK compressed/columnar engines generally implement two types of
storage - write-optimized store (WOS) and read-optimized store (ROS),
where the WOS is mostly just an uncompressed append-only buffer, and ROS
is compressed etc. ISTM the WOS would benefit from a more elaborate WAL
logging, but ROS should be mostly fine with the generic WAL logging.
But yeah, we should test and measure how beneficial that actually is.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services