Hi David, Thanks for the feedback. I’m re-assured that you don’t think the idea is too crazy. 😊
I’ll take a look at the FlightSQL proposal you mention. There is actually a related project to the one I’m working on which will need a more structured approach for data storage. Maybe not SQL like, more Graph Like I think, but still ideas are likely to be applicable. Regarding schema evolution – I am not talking about evolution during a call, but rather over time between gets/puts. I can think of ways to manage that over time, but I wondered if any best practices have started to emerge in this space. Cheers, Gary From: David Li <lidav...@apache.org> Date: Monday, 17 May 2021 at 15:25 To: dev@arrow.apache.org <dev@arrow.apache.org> Subject: Re: [DISCUSS] Parquet/Arrow/Flight as distributed persistence service Hey Gary, Sounds like an interesting project! To speak a bit to the Flight question: I don't think you need a new action; using DoGet/DoPut as you describe makes sense for persistence. There's no required semantics for Flight - it certainly suggests certain patterns (GetFlightInfo -> DoGet for instance) but none of that is formally specified/required, nor is there a generic client that expects to be able to talk to any Flight server. And indeed, you can search the archives of this list for the FlightSQL proposal, which is somewhat similar to your project in spirit (but oriented towards traditional relational databases). As for schema evolution - I think you are not talking about schema evolution during a single Flight RPC call (not (yet) supported), but rather evolving the schema of a stored dataset between reads? (Just to clarify whether this is a question about Flight or not.) Best, David On 2021/05/17 13:21:09, Gary Pennington <gary.penning...@anaplan.com.INVALID> wrote: > Hi, > > (NB: I first floated this question in the arrow-rust slack channel and Jorge > Leitao suggested I should ask here.) > > I’m cranking up a project to provide functionality based on: > parquet/arrow/flight implemented in rust. The primary goals of the project > are to provide a mechanism for storing/retrieving large quantities of column > oriented data across different types of storage mechanism, (S3, filesystem, > etc..). Initially, at least, the flight/arrow/parquet stack looks to be a > great fit for what I’m doing. > > I’ve done some prototyping and so far I’ve made good progress. I have a > simple flight service (written in rust: arrow 4.0.0 stack) which is happy to > send/receive data to/from a very simple flight client (written in python). > > I’ve encountered a few rough edges and before proceeding further I thought > I’d see what other people think of the idea of using flight/arrow to provide > a persistence service (parquet) for large quantities of column oriented data. > > One of my questions is about the use of flight. Flight seems to be primarily > oriented around streams of data (which is cool), but has anyone else > considered using that as the basis for a distributed storage framework? > do_get would read_parquet/send_arrow parquet data and do_put would > receive_arrow/write_parquet it. Or perhaps separate persistence as a new > action? > > Another question is around schema evolution. Any gotchas with this approach. > Do I need to think about a separate schema registry and how would I evolve > data against that registry? > > For now, forget about authn/authz issues, I think the handshake mechanism > will probably suffice, but if not I can roll extensions using the action > mechanism. > > Has anyone else done anything like this? Does it seem like a reasonable use > of the tooling. Any gotchas I should be worrying about? > > Cheers, > > Gary > >