Re: [C++] Control flow and scheduling in C++ Engine operators / exec nodes

Weston Pace Fri, 20 May 2022 21:08:08 -0700

I think I understand what you are saying now.  There are a few
different things going on that lead to a need for backpressure.

1. Write backpressure

We support asynchronous writes.  Filesystems like AWS want many
parallel I/O operations.  On a two core system you might have 8
different parallel writes.  So we do exactly what you described and we
create a queue with a defined capacity.  When that queue fills up we
stop reading.

2. Pipeline prioritization

The method you are describing requires that we prioritize the order of
pipeline processing.  For example, in a join node, we would fully
process the shorter build side and then process the probe side in a
streaming fashion.  We don't currently do that (and it is a separate
discussion whether we would even want to).  Instead we read from both
sides simultaneously.  If the build side read takes too long then we
need to pause reading from the probe side.

3. Non-typical operators

The as-of join node is an interesting operator.  It can have many
different inputs and It requires us to read from all inputs at roughly
the same rate.  However, if some inputs are denser than other inputs
(e.g. more rows per time) then we need to pause the other inputs while
we catch up on the long tail.

On Fri, May 20, 2022 at 5:46 PM Supun Kamburugamuve <su...@apache.org> wrote:
>
> Hi Sasha,
>
> For case 2, I don't see why we need a back-pressure mechanism. Lets say
> there is an IO thread. All we need is a queue with a defined capacity that
> feeds data from IO thread to the Read task.
>
> Supun..
>
> On Fri, May 20, 2022 at 8:25 PM Sasha Krassovsky <krassovskysa...@gmail.com>
> wrote:
>
> > Hi Supun,
> > Roughly what happens now is #2. However, in your example, it may be the
> > case that we are reading CSV data from disk faster than we are transcoding
> > it into Parquet and writing it. Note that we attempt to use the full disk
> > bandwidth and assign batches to cores once the reads are done, so ideally a
> > core is never blocked on IO. In other words, even if we have 2 only cores,
> > we may kick off 100 batch reads and process them when the read completes
> > and a core is available. This is where backpressure is needed: to prevent
> > us from having this huge number of reads piling up and filling up memory
> > faster than we can process.
> >
> > Sasha Krassovsky
> >
> > > 20 мая 2022 г., в 20:09, Supun Kamburugamuve <su...@apache.org>
> > написал(а):
> > >
> >
>
>
> --
> Supun Kamburugamuve

Re: [C++] Control flow and scheduling in C++ Engine operators / exec nodes

Reply via email to