Flight as distributed persistence service

Gary Pennington Mon, 17 May 2021 06:21:48 -0700

Hi,

(NB: I first floated this question in the arrow-rust slack channel and Jorge 
Leitao suggested I should ask here.)


I’m cranking up a project to provide functionality based on: 
parquet/arrow/flight implemented in rust. The primary goals of the project are 
to provide a mechanism for storing/retrieving large quantities of column 
oriented data across different types of storage mechanism, (S3, filesystem, 
etc..). Initially, at least, the flight/arrow/parquet stack looks to be a great 
fit for what I’m doing.

I’ve done some prototyping and so far I’ve made good progress. I have a simple 
flight service (written in rust: arrow 4.0.0 stack) which is happy to 
send/receive data to/from a very simple flight client (written in python).

I’ve encountered a few rough edges and before proceeding further I thought I’d 
see what other people think of the idea of using flight/arrow to provide a 
persistence service (parquet) for large quantities of column oriented data.

One of my questions is about the use of flight. Flight seems to be primarily 
oriented around streams of data (which is cool), but has anyone else considered 
using that as the basis for a distributed storage framework? do_get would 
read_parquet/send_arrow parquet data and do_put would 
receive_arrow/write_parquet it. Or perhaps separate persistence as a new action?

Another question is around schema evolution. Any gotchas with this approach. Do 
I need to think about a separate schema registry and how would I evolve data 
against that registry?

For now, forget about authn/authz issues, I think the handshake mechanism will 
probably suffice, but if not I can roll extensions using the action mechanism.

Has anyone else done anything like this? Does it seem like a reasonable use of 
the tooling. Any gotchas I should be worrying about?

Cheers,

Gary

[DISCUSS] Parquet/Arrow/Flight as distributed persistence service

Reply via email to