Re: Remote datasets

2022-04-12 Thread Weston Pace
I'll add my perspective (which hopefully doesn't confuse things more). I think the fragment concept is a little too specific and the key abstraction here is "stream of homogenous record batches". This manifests in a few different flavors (synchronous, asynchronous, push/pull) but we have some gene

Re: Remote datasets

2022-04-12 Thread Adam Lippai
Hi David, This is a perfect answer. I was looking for the Fragment concept and the issues you linked make it easy to follow. I understand this is a really hard field with a ton of work, getting chunking, prefetch and backpressure correctly + adding filter predicate and other computation pushdown i

Re: Remote datasets

2022-04-12 Thread David Li
TL;DR yes, if and when all is said and done. Breaking this down… Substrait isn't really relevant here. It's a way to serialize a query in a way that's agnostic to whatever's actually generating or executing the query. But if you have a Substrait plan, that can get converted by the Arrow C++ Que

Re: Remote datasets

2022-04-12 Thread Adam Lippai
Hi James, Your answer helps, yes. My question is whether I will be able to join two datasets (producing a new dataset) in a streaming way or do I have to fetch the whole response and keep it in memory? So if my local node has memory constraints, will it be able to stream data from an Apache Flight

Re: Remote datasets

2022-04-12 Thread David Li
Hey Adam, Good question, there are outstanding JIRAs to integrate Flight [1] and HTTP/FTP [2] into Datasets/Filesystems. There are also some JIRAs about various RDBMSes [3] that could also be viewed along a Datasets lens perhaps. Note that this work all proceeds in layers, e.g. it's the C++ qu

Re: Remote datasets

2022-04-12 Thread James Duong
Hi Adam, Arrow Flight can be used to provide an RPC framework that returns datasets (sent over the wire as arrow buffers) and exposes them from a FlightClient as Arrow RecordBatches without serialization. Is this what you mean by remote datasets? Arrow Flight SQL is an application layer built on t