Hey Nate,

Thanks for sharing this & for the detailed docs and writeup. I think your use 
case is interesting, but I'd like to clarify a few things.

I would say Arrow Flight doesn't try to impose a particular model, but I agree 
that Barrage does things that aren't easily doable with Flight. Flight does 
name concepts in a way that suggests how to apply it to something that looks 
like a database, but you can mostly think of Flight as an efficient way to 
transfer Arrow data over the network upon which you can layer further semantics.

However, if I understand right, you're sending data without a fixed schema, in 
the sense that each BarrageRecordBatch may have only a subset of the columns 
declared up front, or may carry new columns? I think this is the main thing you 
can't easily do currently, as Flight (and Arrow IPC in general) assumes a fixed 
schema (and expects all columns in a batch to have the same length). 

Otherwise, the encoding for identifying rows and changes is interesting, but I 
don't quite see why it couldn't be carried as metadata on the side of a record 
batch, instead of having to duplicate the record batch structure, except for 
the aforementioned schema issue. And in that case it might be better to work 
out the schema evolution issue & any ergonomic issues with Flight's existing 
metadata fields/API that would prevent you from using them, as that way you 
(and we!) don't have to fully duplicate one of Arrow's format definitions. 
Similarly, Flight already has a bidirectional streaming endpoint, DoExchange, 
that allows arbitrary payloads (with mixed metadata/data or only one of the 
two), which seems like it should be able to cover the SubscriptionRequest 
endpoint.

Best,
David

On Wed, Mar 3, 2021, at 16:08, Nate Bauernfeind wrote:
> Hello,
> 
> My colleagues at Deephaven Data Labs and I have been addressing problems at
> the intersection of data-driven applications, data science, and updating
> (/ticking) data for some years.
> 
> Deephaven has a query engine that supports updating tabular data via a
> protocol that communicates precise changes about datasets, such as 1) which
> rows were removed, 2) which rows were added, 3) which rows were modified
> (and for which columns). We are inspired by Arrow and would like to adopt a
> version of this protocol that adheres to goals similar to Arrow and Arrow
> Flight.
> 
> Out of the box, Arrow Flight is insufficient to represent such a stream of
> changes. For example, because you cannot identify a particular row within
> an Arrow Flight, you cannot indicate which rows were removed or modified.
> 
> The project integrates with Arrow Flight at the header-metadata level. We
> have preliminarily named the project Barrage as in a "barrage of arrows"
> which plays in the same "namespace" as a "flight of arrows."
> 
> We built this as part of an initiative to modernize and open up our table
> IPC mechanisms. This is part of a larger open source effort which will
> become more visible in the next month or so once we've finished the work
> necessary to share our core software components, including a unified static
> and real time query engine complete with data visualization tools, a REPL
> experience, Jupyter integration, and more.
> 
> I would like to find out:
> - if we have understood the primary goals of Arrow, and are honoring them
> as closely as possible
> - if there are other projects that might benefit from sharing this
> extension of Arrow Flight
> - if there are any gaps that are best addressed early on to maximize future
> compatibility
> 
> A great place to digest the concepts that differ from Arrow Flight are here:
> https://deephaven.github.io/barrage/Concepts.html
> 
> The proposed protocol can be perused here:
> https://github.com/deephaven/barrage
> 
> Internally, we already have a java server and java client implemented as a
> working proof of concept for our use case.
> 
> I really look forward to your feedback; thank you!
> 
> Nate Bauernfeind
> 
> Deephaven Data Labs - https://deephaven.io/
> --
> 

Reply via email to