Re: Arrow as a streaming format

Micah Kornfield Sat, 19 Sep 2020 21:19:00 -0700

>
> Furthermore, these types of queries seem to fit what I would call (for
> lack of a better word) "sliding" dataframes. Arrow's aim (as I understand
> it) is to standardized the static dataframe data structure memory model,
> can it also support a sliding version?



I don't think there are any explicit library features planned around
sliding windows, but the current abstractions allow for combining  lazily
combining columns into a single logical structure and working on that.  I
imagine sliding window abstractions could be built from that.

On Thu, Sep 10, 2020 at 2:04 AM Pedro Silva <pedro.cl...@gmail.com> wrote:

> Hi Micah,
>
> Thank you for your reply and the links, the threads were quite interesting.
> You are right, I opened the flink issue regarding arrow support to
> understand whether it was on their roadmap to take a look at.
>
> My use-case is processing a stream of events (or rows if you will) to
> compute ~100-150 sliding window aggregations over a subset of the received
> fields (say 10 out of a row with 80+ fields).
>
> Something of the sort:
> *average (session_time) group by ID over 1 hour.*
>
> For the query above only 3 fields are required, the session_time, ID and
> timestamp (implicitly required to define time windows) meaning that we can
> discard a significant amount of information from the original event.
>
> Furthermore, these types of queries seem to fit what I would call (for
> lack of a better word) "sliding" dataframes. Arrow's aim (as I understand
> it) is to standardized the static dataframe data structure memory model,
> can it also support a sliding version?
>
> Usually these queries are defined by data scientists and domain experts
> who are comfortable using python and not java or c++ which are the
> languages, streaming engines are built on.
>
> My goal is to understand if existing solutions streaming engines like
> flink can converge into a common model that could in the future help
> develop efficient cross-language streaming engines.
>
> I hope I've been able to clarify some points.
>
> Thanks
>
>
> Em sex., 4 de set. de 2020 às 20:17, Micah Kornfield <
> emkornfi...@gmail.com> escreveu:
>
>> Hi Pedro,
>> I think the answer is it likely depends.  The main trade-off in using
>> Arrow
>> in a streaming process is the high metadata overhead if you have very few
>> rows.  There have been prior discussions on the mailing list about
>> row-based and streaming that might be useful [1][2] in expanding on the
>> trade-offs.
>>
>> For some additional color: Brian Hulette gave a talk [3] a while ago about
>> potentially using Arrow within Beam (I believe flink has a high overlap
>> with the Beam API) and some of the challenges.  It also looks like there
>> was a Flink JIRA (that you might be on?) about using Arrow directly in
>> Flink and some of the trade-offs [4].
>>
>> The questions you posed are a little bit vague, if there is more context
>> it
>> might be able to help make the conversation more productive.
>>
>> -Micah
>>
>> [1]
>>
>> https://lists.apache.org/thread.html/33a4e1a272e77d4959c851481aa25c6e4aa870db172e4c1bbf2e3a35%40%3Cdev.arrow.apache.org%3E
>> [2]
>>
>> https://lists.apache.org/thread.html/27945533db782361143586fd77ca08e15e96e2f2a5250ff084b462d6%40%3Cdev.arrow.apache.org%3E
>> [3] https://www.youtube.com/watch?v=avy1ifTZlhE
>> [4] https://issues.apache.org/jira/browse/FLINK-10929
>>
>>
>> On Fri, Sep 4, 2020 at 12:39 AM Pedro Silva <pedro.cl...@gmail.com>
>> wrote:
>>
>> > Hello,
>> >
>> > This may be a stupid question but is Arrow used for or designed with
>> > streaming processing use-cases in mind, where data is non-stationary.
>> I.e:
>> > Flink stream processing jobs?
>> >
>> > Particularly, is it possible from a given event source (say Kafka) to
>> > efficiently generate incremental record batches for stream processing?
>> >
>> > Suppose there is a data source that continuously generates messages with
>> > 100+ fields. You want to compute grouped aggregations (sums, averages,
>> > count distinct, etc...) over a select few of those fields, say 5 fields
>> at
>> > most used for all queries.
>> >
>> > Is this a valid use-case for Arrow?
>> > What if time is important and some windowing technique has to be
>> applied?
>> >
>> > Thank you very much for your time!
>> > Have a good day.
>> >
>>
>

Re: Arrow as a streaming format

Reply via email to