Hi Christian, welcome. Your code looks reasonable to me at first glance. It does seem possible you're resource-constrained with that t2.micro instance. You might try using a larger instance or reducing the batch size in your call to iter_batches [1] to some very small number.
[1] https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html#pyarrow.parquet.ParquetFile.iter_batches On Fri, May 10, 2024 at 7:30 AM Christian Casazza <christiancasazz...@gmail.com> wrote: > > Hello everyone, > > This is my first time emailing this mailing list, so I hope I am explaining > things correctly below. > > I am attempting to get started with Arrow Flight. I am storing parquet > files and Iceberg tables on S3. I would like to use arrow flight as the > interface data consumers use to access my data so they always receive Arrow > back, where they can then continue to iterate locally with DuckDB, polars, > etc. > > I am first attempting to get it working with a single parquet file in a > private bucket on S3. For this test, I am just putting the credentials and > paths directly in the server code, after working I can move to env before > production. > > The parquet file is about 0.6GB. I am running the EC2 on a t2.micro > instance. > > I was originally running into an ACCESS_DENIED during HeadObject operation > AWS error when attempting to get the flight_info metadata about the file. > From this issue <https://github.com/apache/arrow/issues/37888>, I added in > using s3fs, and I was able to avoid the HeadObject error. So, the client is > able to successfully see the available datasets, and return the schema. > > When I attempt to actually download the data itself, it is causing my EC2 > instance to break down and my SSH connection to drop. Is this likely a > memory issue, or something with my code? > > > The goal is to provide users with a common interface to access my data. > After getting this working, I would add more datasets, data sources, > introduce auth and RBAC, etc. For now, I thought this was a good base > starting point. For now, I am just going with the user downloads the entire > dataset. In the future, I hope to figure out an easy interface to support > more fine grained data/tablescans, or supporting a query first, to return > desired data. > > To keep things simple, I just added my code here > <https://github.com/ChristianCasazza/arrowflights3example>.( > https://github.com/ChristianCasazza/arrowflights3example). > When I was actually testing, I connected to the EC2 instance through VScode > for the server, and I was running the client code locally in a different > window. I removed my actual parquet file path and credentials. > > > This is my first time working with Arrow Flight, so I apologize if I am > overlooking something simple or if the answer was in the docs. > > Any suggestions for changes I can make to get the data download working, or > clear errors I am making? > > Thank you! > > Best, > Christian Casazza