Flight Python EC2 Server for parquet on S3

Christian Casazza Fri, 10 May 2024 08:30:40 -0700

Hello everyone,

This is my first time emailing this mailing list, so I hope I am explaining
things correctly below.


I am attempting to get started with Arrow Flight. I am storing parquet
files and Iceberg tables on S3. I would like to use arrow flight as the
interface data consumers use to access my data so they always receive Arrow
back, where they can then continue to iterate locally with DuckDB, polars,
etc.

I am first attempting to get it working with a single parquet file in a
private bucket on S3. For this test, I am just putting the credentials and
paths directly in the server code, after working I can move to env before
production.

The parquet file is about 0.6GB. I am running the EC2 on a t2.micro
instance.

I was originally running into an ACCESS_DENIED during HeadObject operation
AWS error when attempting to get the flight_info metadata about the file.
>From this issue <https://github.com/apache/arrow/issues/37888>, I added in
using s3fs, and I was able to avoid the HeadObject error. So, the client is
able to successfully see the available datasets, and return the schema.

When I attempt to actually download the data itself, it is causing my EC2
instance to break down and my SSH connection to drop. Is this likely a
memory issue, or something with my code?


The goal is to provide users with a common interface to access my data.
After getting this working, I would add more datasets, data sources,
introduce auth and RBAC, etc. For now, I thought this was a good base
starting point. For now, I am just going with the user downloads the entire
dataset. In the future, I hope to figure out an easy interface to support
more fine grained data/tablescans, or supporting a query first, to return
desired data.

To keep things simple, I just added my code here
<https://github.com/ChristianCasazza/arrowflights3example>.(
https://github.com/ChristianCasazza/arrowflights3example).
When I was actually testing, I connected to the EC2 instance through VScode
for the server, and I was running the client code locally in a different
window. I removed my actual parquet file path and credentials.


This is my first time working with Arrow Flight, so I apologize if I am
overlooking something simple or if the answer was in the docs.

Any suggestions for changes I can make to get the data download working, or
clear errors I am making?

Thank you!

Best,
Christian Casazza

Flight Python EC2 Server for parquet on S3

Reply via email to