+1 for introducing Arrow in streaming processing, as we have made some attempts on this.
IMO, the metadata overhead is not likely to be a problem. If the streaming data is having a high arriving rate, we can compensate for this with a large batch size without impacting the response time, while if the arriving rate is low, the metadata overhead is not likely to be a problem, as the system load is low as well. Best, Liya Fan On Sat, Sep 5, 2020 at 3:17 AM Micah Kornfield <emkornfi...@gmail.com> wrote: > Hi Pedro, > I think the answer is it likely depends. The main trade-off in using Arrow > in a streaming process is the high metadata overhead if you have very few > rows. There have been prior discussions on the mailing list about > row-based and streaming that might be useful [1][2] in expanding on the > trade-offs. > > For some additional color: Brian Hulette gave a talk [3] a while ago about > potentially using Arrow within Beam (I believe flink has a high overlap > with the Beam API) and some of the challenges. It also looks like there > was a Flink JIRA (that you might be on?) about using Arrow directly in > Flink and some of the trade-offs [4]. > > The questions you posed are a little bit vague, if there is more context it > might be able to help make the conversation more productive. > > -Micah > > [1] > > https://lists.apache.org/thread.html/33a4e1a272e77d4959c851481aa25c6e4aa870db172e4c1bbf2e3a35%40%3Cdev.arrow.apache.org%3E > [2] > > https://lists.apache.org/thread.html/27945533db782361143586fd77ca08e15e96e2f2a5250ff084b462d6%40%3Cdev.arrow.apache.org%3E > [3] https://www.youtube.com/watch?v=avy1ifTZlhE > [4] https://issues.apache.org/jira/browse/FLINK-10929 > > > On Fri, Sep 4, 2020 at 12:39 AM Pedro Silva <pedro.cl...@gmail.com> wrote: > > > Hello, > > > > This may be a stupid question but is Arrow used for or designed with > > streaming processing use-cases in mind, where data is non-stationary. > I.e: > > Flink stream processing jobs? > > > > Particularly, is it possible from a given event source (say Kafka) to > > efficiently generate incremental record batches for stream processing? > > > > Suppose there is a data source that continuously generates messages with > > 100+ fields. You want to compute grouped aggregations (sums, averages, > > count distinct, etc...) over a select few of those fields, say 5 fields > at > > most used for all queries. > > > > Is this a valid use-case for Arrow? > > What if time is important and some windowing technique has to be applied? > > > > Thank you very much for your time! > > Have a good day. > > >