Re: Implementing (ARROW-1119) [Python] Enable reading Parquet data sets from Amazon S3

Wes McKinney Thu, 22 Jun 2017 06:19:16 -0700

If you want to use pure Python, you should probably just use the s3fs
package. We should be able to get better throughput using C++ (and making
using multithreading to make multiple requests for larger reads) -- the AWS
C++ SDK probably has everything we need to make a really strong
implementation.


Dato/Turi created an S3 file source implementation in C++
https://github.com/turi-code/SFrame/blob/master/oss_src/fileio/s3_fstream.hpp,
that is BSD licensed and does not depend on the (quite large) AWS C++ SDK,
so that might not be a bad place to start.

On Thu, Jun 22, 2017 at 9:01 AM, Colin Nichols <co...@bam-x.com> wrote:

> I am using a pa.PythonFile() wrapping the file-like object provided by
> s3fs package. I am able to write parquet files directly to S3 this way. I
> am not reading using pyarrow (reading gzipped csvs with python) but I
> imagine it would work much the same.
>
> -- sent from my phone --
>
> > On Jun 22, 2017, at 00:54, Kevin Moore <ke...@quiltdata.io> wrote:
> >
> > Has anyone started looking into how to read data sets from S3? I started
> > looking into it and wondered if anyone has a design in mind.
> >
> > We could implement an S3FileSystem class in pyarrow/filesystem.py. The
> > filesystem components could probably be written against the AWS Python
> SDK.
> >
> > The HDFS file system and file classes, however, are implemented at least
> > partially in Cython & C++. Is there an advantage to doing that for S3
> too?
> >
> > Thanks,
> >
> > Kevin
> >
> > ----
> > Kevin Moore
> > CEO, Quilt Data, Inc.
> > ke...@quiltdata.io | LinkedIn <https://www.linkedin.com/in/kevinemoore/>
> > (415) 497-7895
> >
> >
> > Data packages for fast, reproducible data science
> > quiltdata.com
>

Re: Implementing (ARROW-1119) [Python] Enable reading Parquet data sets from Amazon S3

Reply via email to