Flight as distributed persistence service

Gary Pennington Mon, 17 May 2021 08:23:08 -0700

Hi David,

Thanks for the feedback. I’m re-assured that you don’t think the idea is too 
crazy. 😊

I’ll take a look at the FlightSQL proposal you mention. There is actually a 
related project to the one I’m working on which will need a more structured 
approach for data storage. Maybe not SQL like, more Graph Like I think, but 
still ideas are likely to be applicable.

Regarding schema evolution – I am not talking about evolution during a call, 
but rather over time between gets/puts. I can think of ways to manage that over 
time, but I wondered if any best practices have started to emerge in this space.

Cheers,

Gary

From: David Li <lidav...@apache.org>
Date: Monday, 17 May 2021 at 15:25
To: dev@arrow.apache.org <dev@arrow.apache.org>
Subject: Re: [DISCUSS] Parquet/Arrow/Flight as distributed persistence service
Hey Gary,

Sounds like an interesting project!

To speak a bit to the Flight question: I don't think you need a new
action; using DoGet/DoPut as you describe makes sense for
persistence. There's no required semantics for Flight - it certainly
suggests certain patterns (GetFlightInfo -> DoGet for instance) but
none of that is formally specified/required, nor is there a generic
client that expects to be able to talk to any Flight server.

And indeed, you can search the archives of this list for the FlightSQL
proposal, which is somewhat similar to your project in spirit (but
oriented towards traditional relational databases).

As for schema evolution - I think you are not talking about schema
evolution during a single Flight RPC call (not (yet) supported), but
rather evolving the schema of a stored dataset between reads? (Just to
clarify whether this is a question about Flight or not.)

Best,
David

On 2021/05/17 13:21:09, Gary Pennington <gary.penning...@anaplan.com.INVALID> 
wrote:
> Hi,
>
> (NB: I first floated this question in the arrow-rust slack channel and Jorge 
> Leitao suggested I should ask here.)
>
> I’m cranking up a project to provide functionality based on: 
> parquet/arrow/flight implemented in rust. The primary goals of the project 
> are to provide a mechanism for storing/retrieving large quantities of column 
> oriented data across different types of storage mechanism, (S3, filesystem, 
> etc..). Initially, at least, the flight/arrow/parquet stack looks to be a 
> great fit for what I’m doing.
>
> I’ve done some prototyping and so far I’ve made good progress. I have a 
> simple flight service (written in rust: arrow 4.0.0 stack) which is happy to 
> send/receive data to/from a very simple flight client (written in python).
>
> I’ve encountered a few rough edges and before proceeding further I thought 
> I’d see what other people think of the idea of using flight/arrow to provide 
> a persistence service (parquet) for large quantities of column oriented data.
>
> One of my questions is about the use of flight. Flight seems to be primarily 
> oriented around streams of data (which is cool), but has anyone else 
> considered using that as the basis for a distributed storage framework? 
> do_get would read_parquet/send_arrow parquet data and do_put would 
> receive_arrow/write_parquet it. Or perhaps separate persistence as a new 
> action?
>
> Another question is around schema evolution. Any gotchas with this approach. 
> Do I need to think about a separate schema registry and how would I evolve 
> data against that registry?
>
> For now, forget about authn/authz issues, I think the handshake mechanism 
> will probably suffice, but if not I can roll extensions using the action 
> mechanism.
>
> Has anyone else done anything like this? Does it seem like a reasonable use 
> of the tooling. Any gotchas I should be worrying about?
>
> Cheers,
>
> Gary
>
>

Re: [DISCUSS] Parquet/Arrow/Flight as distributed persistence service

Reply via email to