Greetings,

  I was poking around the Append node and considering my earlier
  suggestion to build an Async method of pulling data out of nodes under
  Append which support that option.  It looks workable, but I was also
  considering simply changing postgres_fdw to use async queries instead-
  which are started at ExecInit time instead.

  A few more specifics about what I've been thinking:

  #1 - Add an Async mechanism to the executor

    - ExecAsyncCapable - Check if the node supports async.
    - ExecAsyncStart - Begin an async request from the node
    - ExecAsyncGetFD - Return FD to be used with select()
    - ExecAsyncConsume - Consume data from FD
    - ExecAsyncBusy - Check if getResult would block
    - ExecAsyncScan - Get next tuple
    - All of the above added to the FDW API also..

    This is clearly modeled off of the libpq async system and can likely
    be simplified, but it feels like we'd want at least things like
    GetFD/Consume/Busy, to allow us to build a select() inside Append
    which would wait until data is available somewhere and would then
    work through all of the nodes (dealing with ones which are done and
    don't return anything) until it's able to return a row back up.  As
    an aside- I've always wondered if we should have an explicit bulk
    mechanism instead of one-row-at-a-time and perhaps we build that
    into this API.  I was also thinking we would have an identifier of
    some kind provided through the API which would indicate which I/O
    channels are shared and we'd then set up Append with a two-level
    list of paths to walk, where it would issue one request against each
    distinct I/O channel and simply move on to the next entry for the
    same I/O channel when the prior one completes.

  #2 - Make postgres_fdw build/send an async query during ExecInitNode.

    Currently, we wait to set up the remote cursor and fetch records
    until we've actually been asked for a record- but then we go and try
    to get 100 of them.  It would seem like we might be able to simply
    call create_cursor() at the bottom of postgresBeginForeignScan and
    follow that up with a PQsendQuery, postgresIterateForeignScan
    wouldn't really change except for not being asked to also create the
    cursor.  Sure, we'd end up blocking if there isn't data available
    for this node yet, but at least we'd get the query started during
    Init across all the remote servers, which would certainly be a
    massive improvment and we wouldn't need to modify Append or the FDW
    API at all.  Practically speaking, we'd parallelize the initial
    request of 100 tuples and then scan through the results in order,
    meaning we'd only pull from one machine at a time if the result set
    is larger than that initial 100 tuples per system.  That said, if
    most of the work on the remote systems is getting the query started,
    it would still be a win.

    This goes against the current documention, which explicitly states
    that the actual scan should not be started until IterateForeignScan,
    but it's not clear, to me at least, why that's strictly necessary.

  One other thought going in favor of #1 is that we could make other
  nodes, such as SeqScan, support the new API which would allow us to
  parallelize across, say, mutliple tablespaces (on the presumption that
  they are independent I/O systems underneath, which may or may not be
  true, of course; perhaps we could explicitly ask the user for the I/O
  channel relationship during tablespace creation).
  
  With either suggestion, we would need to ensure that postgres_fdw
  works through the entire Async query and stores the results before
  trying to do another Async query, if we're going to keep the
  one-connection model, as we can't have two Async queries running at
  the same time.  I don't see that as a terrible issue though- we're
  already fetching/cacheing up to 100 rows of data for the foreign
  table anyway, and we'd still have the actual cursor.  If we get a
  request for a different cursor while we have an Async still open, we'd
  just finish out the current Async request of 100-or-fewer tuples,
  cache them, and then send a new request for the new relation.

  This is all speculation at this point as I've not gotten down into
  the details of trying to implement any of it, so please feel free to
  point out any big holes. :)

  Thoughts?

    Thanks,

        Stephen

Attachment: signature.asc
Description: Digital signature

Reply via email to