> > Furthermore, these types of queries seem to fit what I would call (for > lack of a better word) "sliding" dataframes. Arrow's aim (as I understand > it) is to standardized the static dataframe data structure memory model, > can it also support a sliding version?
I don't think there are any explicit library features planned around sliding windows, but the current abstractions allow for combining lazily combining columns into a single logical structure and working on that. I imagine sliding window abstractions could be built from that. On Thu, Sep 10, 2020 at 2:04 AM Pedro Silva <pedro.cl...@gmail.com> wrote: > Hi Micah, > > Thank you for your reply and the links, the threads were quite interesting. > You are right, I opened the flink issue regarding arrow support to > understand whether it was on their roadmap to take a look at. > > My use-case is processing a stream of events (or rows if you will) to > compute ~100-150 sliding window aggregations over a subset of the received > fields (say 10 out of a row with 80+ fields). > > Something of the sort: > *average (session_time) group by ID over 1 hour.* > > For the query above only 3 fields are required, the session_time, ID and > timestamp (implicitly required to define time windows) meaning that we can > discard a significant amount of information from the original event. > > Furthermore, these types of queries seem to fit what I would call (for > lack of a better word) "sliding" dataframes. Arrow's aim (as I understand > it) is to standardized the static dataframe data structure memory model, > can it also support a sliding version? > > Usually these queries are defined by data scientists and domain experts > who are comfortable using python and not java or c++ which are the > languages, streaming engines are built on. > > My goal is to understand if existing solutions streaming engines like > flink can converge into a common model that could in the future help > develop efficient cross-language streaming engines. > > I hope I've been able to clarify some points. > > Thanks > > > Em sex., 4 de set. de 2020 às 20:17, Micah Kornfield < > emkornfi...@gmail.com> escreveu: > >> Hi Pedro, >> I think the answer is it likely depends. The main trade-off in using >> Arrow >> in a streaming process is the high metadata overhead if you have very few >> rows. There have been prior discussions on the mailing list about >> row-based and streaming that might be useful [1][2] in expanding on the >> trade-offs. >> >> For some additional color: Brian Hulette gave a talk [3] a while ago about >> potentially using Arrow within Beam (I believe flink has a high overlap >> with the Beam API) and some of the challenges. It also looks like there >> was a Flink JIRA (that you might be on?) about using Arrow directly in >> Flink and some of the trade-offs [4]. >> >> The questions you posed are a little bit vague, if there is more context >> it >> might be able to help make the conversation more productive. >> >> -Micah >> >> [1] >> >> https://lists.apache.org/thread.html/33a4e1a272e77d4959c851481aa25c6e4aa870db172e4c1bbf2e3a35%40%3Cdev.arrow.apache.org%3E >> [2] >> >> https://lists.apache.org/thread.html/27945533db782361143586fd77ca08e15e96e2f2a5250ff084b462d6%40%3Cdev.arrow.apache.org%3E >> [3] https://www.youtube.com/watch?v=avy1ifTZlhE >> [4] https://issues.apache.org/jira/browse/FLINK-10929 >> >> >> On Fri, Sep 4, 2020 at 12:39 AM Pedro Silva <pedro.cl...@gmail.com> >> wrote: >> >> > Hello, >> > >> > This may be a stupid question but is Arrow used for or designed with >> > streaming processing use-cases in mind, where data is non-stationary. >> I.e: >> > Flink stream processing jobs? >> > >> > Particularly, is it possible from a given event source (say Kafka) to >> > efficiently generate incremental record batches for stream processing? >> > >> > Suppose there is a data source that continuously generates messages with >> > 100+ fields. You want to compute grouped aggregations (sums, averages, >> > count distinct, etc...) over a select few of those fields, say 5 fields >> at >> > most used for all queries. >> > >> > Is this a valid use-case for Arrow? >> > What if time is important and some windowing technique has to be >> applied? >> > >> > Thank you very much for your time! >> > Have a good day. >> > >> >