I'm not an Arrow contributor (perhaps one day!) but as a close follower and user of the project for the last six months (Arrow Flight specifically), I kind of jumped out of my chair when I saw this today. It's *exactly* what my team is looking for and something I have been close to building myself. I suspect our use case is similar to Nate's, seeing that Deephaven operates at least somewhat in the financial space (as does my employer PEAK6), though I absolutely believe it would be valuable in other domains as well.
I will be reading Nate's docs and hopefully contributing or at least giving feedback on whatever comes of it. I'm tempted to share thoughts already but am going to spend some time absorbing to reduce the risk of derailing the good conversation already. > As a side note - is said UI browser-based? For what it's worth, our use case would not be browser-based. Best, Paul On Wed, Mar 3, 2021 at 5:22 PM David Li <lidav...@apache.org> wrote: > Ah okay, thank you for clarifying! In that case, if each payload has two > batches with different purposes - might it make sense to just make that two > different payloads, and set a flag/enum in the metadata to indicate how to > interpret the batch? Then you'd be officially the same as Arrow Flight :) > > As a side note - is said UI browser-based? Another project recently was > planning to look at JavaScript support for Flight (using WebSockets as the > transport, IIRC) and it might make sense to join forces if that's a path > you were also going to pursue. > > Best, > David > > On Wed, Mar 3, 2021, at 18:05, Nate Bauernfeind wrote: > > Thanks for the interest =). > > > > > However, if I understand right, you're sending data without a fixed > > schema [...] > > > > The dataset does have a known schema ahead of time, which is similar to > > Flight. However, as you point out, the subscription can change which > > columns it is interested in without re-acquiring data for columns it was > > already subscribed to. This is mostly for convenience. We use it > primarily > > to limit which columns are sent to our user interface until the user > > scrolls them into view. > > > > The enhancement of the RecordBatch here, aside from the additional > > metadata, is only in that the payload has two sets of RecordBatch > payloads. > > The first payload is for added rows, every added row must send data for > > each column subscribed; based on the subscribed columns this is otherwise > > fixed width (in the number of columns / buffers). The second payload is > for > > modified rows. Here we only send the columns that have rows that are > > modified. Aside from this difference, I have been aiming to be compatible > > enough to be able to reuse the payload parsing that is already written > for > > Arrow. > > > > > I don't quite see why it couldn't be carried as metadata on the side > of a > > record batch, instead of having to duplicate the record batch structure > > [...] > > > > Whoa, this is a good point. I have iterated on this a few times to get it > > closer to Arrow's setup and did not realize that 'BarrageData' is now > > officially identical to `FlightData`. This is an instance of being too > > close to the project and forgetting to step back once in a while. > > > > > Flight already has a bidirectional streaming endpoint, DoExchange, that > > allows arbitrary payloads (with mixed metadata/data or only one of the > > two), which seems like it should be able to cover the SubscriptionRequest > > endpoint. > > > > This is exactly the kind of feedback I'm looking for! I wasn't seeing the > > solution where the client-side stream doesn't actually need payload and > > that the subscription changes can be described with another flatbuffer > > metadata type. I like that. > > > > Thanks David! > > Nate > > > > On Wed, Mar 3, 2021 at 3:28 PM David Li <lidav...@apache.org> wrote: > > > > > Hey Nate, > > > > > > Thanks for sharing this & for the detailed docs and writeup. I think > your > > > use case is interesting, but I'd like to clarify a few things. > > > > > > I would say Arrow Flight doesn't try to impose a particular model, but > I > > > agree that Barrage does things that aren't easily doable with Flight. > > > Flight does name concepts in a way that suggests how to apply it to > > > something that looks like a database, but you can mostly think of > Flight as > > > an efficient way to transfer Arrow data over the network upon which > you can > > > layer further semantics. > > > > > > However, if I understand right, you're sending data without a fixed > > > schema, in the sense that each BarrageRecordBatch may have only a > subset of > > > the columns declared up front, or may carry new columns? I think this > is > > > the main thing you can't easily do currently, as Flight (and Arrow IPC > in > > > general) assumes a fixed schema (and expects all columns in a batch to > have > > > the same length). > > > > > > Otherwise, the encoding for identifying rows and changes is > interesting, > > > but I don't quite see why it couldn't be carried as metadata on the > side of > > > a record batch, instead of having to duplicate the record batch > structure, > > > except for the aforementioned schema issue. And in that case it might > be > > > better to work out the schema evolution issue & any ergonomic issues > with > > > Flight's existing metadata fields/API that would prevent you from using > > > them, as that way you (and we!) don't have to fully duplicate one of > > > Arrow's format definitions. Similarly, Flight already has a > bidirectional > > > streaming endpoint, DoExchange, that allows arbitrary payloads (with > mixed > > > metadata/data or only one of the two), which seems like it should be > able > > > to cover the SubscriptionRequest endpoint. > > > > > > Best, > > > David > > > > > > On Wed, Mar 3, 2021, at 16:08, Nate Bauernfeind wrote: > > > > Hello, > > > > > > > > My colleagues at Deephaven Data Labs and I have been addressing > problems > > > at > > > > the intersection of data-driven applications, data science, and > updating > > > > (/ticking) data for some years. > > > > > > > > Deephaven has a query engine that supports updating tabular data via > a > > > > protocol that communicates precise changes about datasets, such as 1) > > > which > > > > rows were removed, 2) which rows were added, 3) which rows were > modified > > > > (and for which columns). We are inspired by Arrow and would like to > > > adopt a > > > > version of this protocol that adheres to goals similar to Arrow and > Arrow > > > > Flight. > > > > > > > > Out of the box, Arrow Flight is insufficient to represent such a > stream > > > of > > > > changes. For example, because you cannot identify a particular row > within > > > > an Arrow Flight, you cannot indicate which rows were removed or > modified. > > > > > > > > The project integrates with Arrow Flight at the header-metadata > level. We > > > > have preliminarily named the project Barrage as in a "barrage of > arrows" > > > > which plays in the same "namespace" as a "flight of arrows." > > > > > > > > We built this as part of an initiative to modernize and open up our > table > > > > IPC mechanisms. This is part of a larger open source effort which > will > > > > become more visible in the next month or so once we've finished the > work > > > > necessary to share our core software components, including a unified > > > static > > > > and real time query engine complete with data visualization tools, a > REPL > > > > experience, Jupyter integration, and more. > > > > > > > > I would like to find out: > > > > - if we have understood the primary goals of Arrow, and are honoring > them > > > > as closely as possible > > > > - if there are other projects that might benefit from sharing this > > > > extension of Arrow Flight > > > > - if there are any gaps that are best addressed early on to maximize > > > future > > > > compatibility > > > > > > > > A great place to digest the concepts that differ from Arrow Flight > are > > > here: > > > > https://deephaven.github.io/barrage/Concepts.html > > > > > > > > The proposed protocol can be perused here: > > > > https://github.com/deephaven/barrage > > > > > > > > Internally, we already have a java server and java client > implemented as > > > a > > > > working proof of concept for our use case. > > > > > > > > I really look forward to your feedback; thank you! > > > > > > > > Nate Bauernfeind > > > > > > > > Deephaven Data Labs - https://deephaven.io/ > > > > -- > > > > > > > > > >