You should be able to use s3fs, both the file handles it creates as well as a filesystem to read multifile datasets:
https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_parquet.py#L1441 On Fri, Oct 12, 2018 at 12:03 PM Luke <[email protected]> wrote: > > It looks like https://github.com/dask/s3fs implements these methods. Would > there need to be a wrapper over this for arrow or is it compatible as is? > > -Luke > > On Fri, Oct 12, 2018 at 9:13 AM Uwe L. Korn <[email protected]> wrote: >> >> That looks nice. Once you have wrapped that in a class that implements read >> and seek like a Python file object, you should be able to pass this to >> `pyarrow.parquet.read_table`. When you then set the columns argument on that >> function, only the respective byte ranges are then requested from S3. To >> minimise the number of requests, I would suggest you to implement the S3 >> file with the exact ranges provided from the outside but when using pyarrow, >> you should wrap your S3 file in an io.BufferedReader. pyarrow.parquet >> requests exactly the ranges it needs but that can sometimes be too coarse >> for object stores like S3. There you often like to do the tradeoff of >> requesting some bytes more for a fewer number of requests. >> >> Uwe >> >> >> On Thu, Oct 11, 2018, at 11:27 PM, Luke wrote: >> >> This works in boto3: >> >> import boto3 >> >> obj = boto3.resource('s3').Object('mybucketfoo', 'foo') >> stream = obj.get(Range='bytes=10-100')['Body'] >> print(stream.read()) >> >> >> On Thu, Oct 11, 2018 at 2:22 PM Uwe L. Korn <[email protected]> wrote: >> >> >> Hello Luke, >> >> this is only partly implemented. You can do this and I already did do this >> but this is sadly not in a perfect state. >> >> boto3 itself seems to be lacking a proper file-like class. You can get the >> contents of a file in S3 as >> https://botocore.amazonaws.com/v1/documentation/api/latest/reference/response.html#botocore.response.StreamingBody >> . This sadly seems to be missing a seek method. >> >> In my case I did access parquet files on S3 with per-column access using the >> simplekv project. There a small file-like class is implemented on top of >> boto (but not boto3): >> https://github.com/mbr/simplekv/blob/master/simplekv/net/botostore.py#L93 . >> This is what you are looking for, just the wrong boto package as well as I >> know that this implementation is sadly leaking http-connections and thus >> when you access too many files (even in serial) at once, your network will >> suffer. >> >> Cheers >> Uwe >> >> >> On Thu, Oct 11, 2018, at 8:01 PM, Luke wrote: >> >> I have parquet files (each self contained) in S3 and I want to read certain >> columns into a pandas dataframe without reading the entire object out of S3. >> >> Is this implemented? boto3 in python supports reading from offsets in an S3 >> object but I wasn't sure anyone has made that work with a parquet file >> corresponding to certain columns? >> >> thanks, >> Luke >> >> >>
