Re: Arrow streaming computation engine

Weston Pace Tue, 11 Jan 2022 15:22:46 -0800

First, note that there are different computation engines in different
languages.  The Rust implementation has datafusion[1] for example.
For the rest of this email, I will speak in more detail specifically
about the C++ computation engine (which I am more familiar with) that
is in place today.  The C++ engine is documented here[2] although that
documentation is a little scarce and we are working on an updated
version[3].

Also note that the docs describe a "Streaming execution engine"
because it operates on the data in a batch-oriented fashion.  However,
this doesn't guarantee that it will use a small amount of memory.  For
example, if you were to request that the engine sort the data then the
engine may need to cache the entire dataset into memory (in the future
this may mean spilling into temporary tables as memory runs out) in
order to fulfill that query (because the very last row you read might
be the very first row you need to emit).  However, for properly
constructed queries, the engine should be able to operate as you are
describing.  The queries you are describing sound to me like what one
might expect to find in a "time series database" which is another term
I've heard thrown around.

I am not an expert in time series databases so I don't know the extent
of the computation required.  However, the example you give (7 day
rolling mean of daily US stock prices) is not something that could be
efficiently computed today.  It is something that could be efficiently
computed once "window functions" are supported.  Window functions[4]
are a query engine feature that enables the sliding window needed for
a rolling average.  I believe there are people at Voltron Data that
are hoping to add support for these window functions to the C++
streaming execution engine but that is future work that is not
currently in progress.  That being said, a time series execution
engine would probably also need to know about indices, statistics,
whether the data on disk is sorted or not (and by what columns),
downsampling functions, interpolation functions, etc.  In addition,
beyond execution / computation there are concerns such as retention
policies, streaming / appending data to disk, etc.

> So I am wondering if there is
> a way to design an engine that can satisfy both streaming and batch mode of
> processing. Or maybe it needs to be seperate engines but we can minimize
> the amount of duplication?

Regardless of the timeline and plans for window functions the answer
to this specific question is probably "yes" but I'm not enough of an
expert in time series processing to answer with certainty.  The
streaming execution engine in Arrow today is quite generic.  A graph
of "exec nodes" is constructed.  Data is passed through these exec
nodes starting from one or more sources and then ending at a sink.
The sources could be live data to satisfy your request for (3).  The
plan is currently run very similar to an actor model where batches are
pushed from one node to another.  I'm hoping to add more support for
scheduling and backpressure at some point.  Given what I know of the
types of queries you are describing I think this model should suffice
to run those queries efficiently.

So, summarizing, I think some of the work we are doing will be useful
to you (though possibly not sufficient) and it would be a good idea to
reuse & share where possible.

[1] https://docs.rs/datafusion/latest/datafusion/
[2] https://arrow.apache.org/docs/cpp/streaming_execution.html
[3] https://github.com/apache/arrow/pull/12033
[4] 
https://medium.com/an-idea/rolling-sum-and-average-window-functions-mysql-7509d1d576e6

On Tue, Jan 11, 2022 at 11:19 AM Li Jin <ice.xell...@gmail.com> wrote:
>
> Hi,
>
> This is a somewhat lengthy email about thoughts around a streaming
> computation engine for Arrow dataset that I would like to hear feedback
> from Arrow devs.
>
> The main use cases that we are thinking for the streaming engine are time
> series data, i.e., data arrives in time order (e.g. daily US stock prices)
> and the query often follows the time order of the data (e.g., compute 7 day
> rolling mean of daily US stock prices).
>
> The main motivations for a streaming engine is (1) performance: always
> keeps small amount of hot data always in memory and cache (2)
> memory efficiency: the engine only need to keep small amounts of data in
> memory, e.g., for the 7 day rolling mean case, the engine never need to
> keep more than 7 day worth of stock price data, even it is computing this
> for a stream of 20 year data. (3) Live data application: data arrives in
> real time
>
> I have talked to Phillip Cloud and am aware of an effort going on to build
> a computation engine for SQL-like queries (mostly query on the entire
> dataset) but am unfamiliar with the details. So I am wondering if there is
> a way to design an engine that can satisfy both streaming and batch mode of
> processing. Or maybe it needs to be seperate engines but we can minimize
> the amount of duplication?
>
> Looking forward to any thoughts around this.
>
> Li

Re: Arrow streaming computation engine

Reply via email to