On 26-11-2020 14:45, Bharath Rupireddy wrote:
On Thu, Nov 26, 2020 at 5:34 PM Luc Vlaming <l...@swarm64.com> wrote:

On 26-11-2020 12:36, Bharath Rupireddy wrote:
Few things:

IIUC Andres mentioned similar kinds of APIs earlier in [1].

[1] -
https://www.postgresql.org/message-id/20200924024128.kyk3r5g7dnu3fxxx%40alap3.anarazel.de
<https://www.postgresql.org/message-id/20200924024128.kyk3r5g7dnu3fxxx%40alap3.anarazel.de>

I would like to add some more info to one of the API:

typedef struct MultiInsertStateData
{
      MemoryContext         micontext; /* A temporary memory context for
multi insert. */
      BulkInsertStateData *bistate;   /* Bulk insert state. */
      TupleTableSlot      **mislots; /* Array of buffered slots. */
      uint32              nslots; /* Total number of buffered slots. */
      int64              nbytes; /* Flush buffers if the total tuple size
  >= nbytes. */
      int32              nused; /* Number of current buffered slots for a
multi insert batch. */
      int64              nsize; /* Total tuple size for a multi insert
batch. */
} MultiInsertStateData;

/* Creates a temporary memory context, allocates the
MultiInsertStateData, BulkInsertStateData and initializes other members. */
      void        (*begin_multi_insert) (Relation rel,
MultiInsertStateData **mistate, uint32 nslots, uint64 nbytes);

/* Buffers the input slot into mistate slots, computes the size of the
tuple, and adds it total buffer tuple size, if this size crosses
mistate->nbytes, flush the buffered tuples into table. For heapam,
existing heap_multi_insert can be used. Once the buffer is flushed, then
the micontext can be reset and buffered slots can be cleared. *If nbytes
i.e. total tuple size of the batch is not given, tuple size is not
calculated, tuples are buffered until all the nslots are filled and then
flushed.* */
      void        (*do_multi_insert) (Relation rel, MultiInsertStateData
*mistate, struct TupleTableSlot *slot, CommandId cid, int options);

/* Flush the buffered tuples if any. For heapam, existing
heap_multi_insert can be used. Deletes temporary memory context and
deallocates mistate. */
      void        (*end_multi_insert) (Relation rel, MultiInsertStateData
*mistate, CommandId cid, int options);

Looks all good to me, except for the nbytes part.
Could you explain to me what use case that supports? IMHO the tableam
can best decide itself that its time to flush, based on its
implementation that e.g. considers how many pages to flush at a time and
such, etc? This means also that most of the fields of
MultiInsertStateData can be private as each tableam would return a
derivative of that struct (like with the destreceivers).


nbytes is basically to support the following case, say the number of
tuples to buffer is 1000, and if all the tuples are toasted with size
in few hundred MB or even GB, then do we want to wait until 1000
tuples are buffered in which case we occupy for one query 1000*toasted
tuple size in GB. So, if we have a memory limit, then it will give
flexibility. Whether to use it or not is up to the table AM
implementation. And also that existing copy code(since it can know the
tuple size after parsing input data) uses this mechanism to decide
when to flush.

If the nbytes is not used in a table am, then the multi insert can
wait until the total tuples, how much ever large memory they occupy,
are buffered.

IMO, we can retain nbytes for now to decide on when to flush. Thoughts?

I'm very sorry I had not realized at all that the toasted data would be kept in memory until written out. I guess I'm not familiar enough with that part yet. I assumed this would be toasted beforehand and be tableam agnostic, and that any decision from the tableam to flush would happen way before a lot memory would have accumulated, which is a bit naive in hindsight.


I wonder, how can the do_multi_insert() API decide on when to flush, I
mean, based on the number of pages to flush? Do we need to pass the
maximum number of pages the buffered tuples can occupy and track the
pages currently buffered tuples occupy to decide when to flush? Or is
it something that the existing table AM infrastructure already
supports? If we use the number of pages to decide on when to flush,
how well it works with parallel inserts?


I was assuming each tableam to use its own logic, based on its needs and the tradeoffs a storage engine might want to provide. This does not mean it should not consider outside parameters, like the aforementioned memory usage. I think it would imply that each tableam implements its own tracking mechanism for how much has accumulated, how, and when to flush, because they might track different statistics. IMHO given that each tableam anyway would want to implement its own logic on how to store a slot into a page, tracking the logic for tracking these statistics seemed minor to me. Maybe I missed some parts that should be extracted out to a generic interface however?

Some examples of why a tableam could decide on its own on when to flush:
- the current heap implementation could accumulate a few pages (say up to 64) and thereby limit the amount of calls to write() and limit the accompanying blocks/context switches. This would also then make the writes more sequential wrt the processes which can help with the flushing I presume, like how the sequential scan was optimized to process a consequtive set of blocks per worker (see table_block_parallelscan_nextpage). - something like zheap could accumulate data based on the amount of columns so that a page with column data is completely filled, thereby limiting the write amplification. - something that would implement an lsm storage might accumulate a full in-memory level before flushing it out.


One thing I'm wondering is in which memory context the slots end up
being allocated. I'd assume we would want to keep the slots around
between flushes. If they are in the temporary context this might prove
problematic however?


I should not have used the word temporary, it actually is not
temporary. This memory conext will be created in begin_multi_insert(),
all the buffered tuples are copied using this context, it will be
reset at the end of each flush and reused. It can get destroyed at the
end in end_multi_insert(). I think we should even do this with the new
APIs implementation.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


Okay. But in which context are the slots themselves allocated then? Because if we allocate the slots themselves in the context and then reset with each flush we would have to also re-allocate the slots every flush, which seems wasteful to me?

Regards,
Luc
Swarm64


Reply via email to