If you want to use pure Python, you should probably just use the s3fs package. We should be able to get better throughput using C++ (and making using multithreading to make multiple requests for larger reads) -- the AWS C++ SDK probably has everything we need to make a really strong implementation.
Dato/Turi created an S3 file source implementation in C++ https://github.com/turi-code/SFrame/blob/master/oss_src/fileio/s3_fstream.hpp, that is BSD licensed and does not depend on the (quite large) AWS C++ SDK, so that might not be a bad place to start. On Thu, Jun 22, 2017 at 9:01 AM, Colin Nichols <co...@bam-x.com> wrote: > I am using a pa.PythonFile() wrapping the file-like object provided by > s3fs package. I am able to write parquet files directly to S3 this way. I > am not reading using pyarrow (reading gzipped csvs with python) but I > imagine it would work much the same. > > -- sent from my phone -- > > > On Jun 22, 2017, at 00:54, Kevin Moore <ke...@quiltdata.io> wrote: > > > > Has anyone started looking into how to read data sets from S3? I started > > looking into it and wondered if anyone has a design in mind. > > > > We could implement an S3FileSystem class in pyarrow/filesystem.py. The > > filesystem components could probably be written against the AWS Python > SDK. > > > > The HDFS file system and file classes, however, are implemented at least > > partially in Cython & C++. Is there an advantage to doing that for S3 > too? > > > > Thanks, > > > > Kevin > > > > ---- > > Kevin Moore > > CEO, Quilt Data, Inc. > > ke...@quiltdata.io | LinkedIn <https://www.linkedin.com/in/kevinemoore/> > > (415) 497-7895 > > > > > > Data packages for fast, reproducible data science > > quiltdata.com >