Re: Flight Python EC2 Server for parquet on S3

Bryce Mecum Fri, 10 May 2024 11:11:51 -0700

Hi Christian, welcome.

Your code looks reasonable to me at first glance. It does seem
possible you're resource-constrained with that t2.micro instance. You
might try using a larger instance or reducing the batch size in your
call to iter_batches [1] to some very small number.


[1] 
https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html#pyarrow.parquet.ParquetFile.iter_batches

On Fri, May 10, 2024 at 7:30 AM Christian Casazza
<christiancasazz...@gmail.com> wrote:
>
> Hello everyone,
>
> This is my first time emailing this mailing list, so I hope I am explaining
> things correctly below.
>
> I am attempting to get started with Arrow Flight. I am storing parquet
> files and Iceberg tables on S3. I would like to use arrow flight as the
> interface data consumers use to access my data so they always receive Arrow
> back, where they can then continue to iterate locally with DuckDB, polars,
> etc.
>
> I am first attempting to get it working with a single parquet file in a
> private bucket on S3. For this test, I am just putting the credentials and
> paths directly in the server code, after working I can move to env before
> production.
>
> The parquet file is about 0.6GB. I am running the EC2 on a t2.micro
> instance.
>
> I was originally running into an ACCESS_DENIED during HeadObject operation
> AWS error when attempting to get the flight_info metadata about the file.
> From this issue <https://github.com/apache/arrow/issues/37888>, I added in
> using s3fs, and I was able to avoid the HeadObject error. So, the client is
> able to successfully see the available datasets, and return the schema.
>
> When I attempt to actually download the data itself, it is causing my EC2
> instance to break down and my SSH connection to drop. Is this likely a
> memory issue, or something with my code?
>
>
> The goal is to provide users with a common interface to access my data.
> After getting this working, I would add more datasets, data sources,
> introduce auth and RBAC, etc. For now, I thought this was a good base
> starting point. For now, I am just going with the user downloads the entire
> dataset. In the future, I hope to figure out an easy interface to support
> more fine grained data/tablescans, or supporting a query first, to return
> desired data.
>
> To keep things simple, I just added my code here
> <https://github.com/ChristianCasazza/arrowflights3example>.(
> https://github.com/ChristianCasazza/arrowflights3example).
> When I was actually testing, I connected to the EC2 instance through VScode
> for the server, and I was running the client code locally in a different
> window. I removed my actual parquet file path and credentials.
>
>
> This is my first time working with Arrow Flight, so I apologize if I am
> overlooking something simple or if the answer was in the docs.
>
> Any suggestions for changes I can make to get the data download working, or
> clear errors I am making?
>
> Thank you!
>
> Best,
> Christian Casazza

Re: Flight Python EC2 Server for parquet on S3

Reply via email to