Re: parquet file in S3, is there a way to read a subset of all the columns in python

Wes McKinney Sun, 14 Oct 2018 12:30:23 -0700

You should be able to use s3fs, both the file handles it creates as
well as a filesystem to read multifile datasets:


https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_parquet.py#L1441
On Fri, Oct 12, 2018 at 12:03 PM Luke <[email protected]> wrote:
>
> It looks like https://github.com/dask/s3fs implements these methods.  Would 
> there need to be a wrapper over this for arrow or is it compatible as is?
>
> -Luke
>
> On Fri, Oct 12, 2018 at 9:13 AM Uwe L. Korn <[email protected]> wrote:
>>
>> That looks nice. Once you have wrapped that in a class that implements read 
>> and seek like a Python file object, you should be able to pass this to 
>> `pyarrow.parquet.read_table`. When you then set the columns argument on that 
>> function, only the respective byte ranges are then requested from S3. To 
>> minimise the number of requests, I would suggest you to implement the S3 
>> file with the exact ranges provided from the outside but when using pyarrow, 
>> you should wrap your S3 file in an io.BufferedReader. pyarrow.parquet 
>> requests exactly the ranges it needs but that can sometimes be too coarse 
>> for object stores like S3. There you often like to do the tradeoff of 
>> requesting some bytes more for a fewer number of requests.
>>
>> Uwe
>>
>>
>> On Thu, Oct 11, 2018, at 11:27 PM, Luke wrote:
>>
>> This works in boto3:
>>
>> import boto3
>>
>> obj = boto3.resource('s3').Object('mybucketfoo', 'foo')
>> stream = obj.get(Range='bytes=10-100')['Body']
>> print(stream.read())
>>
>>
>> On Thu, Oct 11, 2018 at 2:22 PM Uwe L. Korn <[email protected]> wrote:
>>
>>
>> Hello Luke,
>>
>> this is only partly implemented. You can do this and I already did do this 
>> but this is sadly not in a perfect state.
>>
>> boto3 itself seems to be lacking a proper file-like class. You can get the 
>> contents of a file in S3 as 
>> https://botocore.amazonaws.com/v1/documentation/api/latest/reference/response.html#botocore.response.StreamingBody
>>  . This sadly seems to be missing a seek method.
>>
>> In my case I did access parquet files on S3 with per-column access using the 
>> simplekv project. There a small file-like class is implemented on top of 
>> boto (but not boto3): 
>> https://github.com/mbr/simplekv/blob/master/simplekv/net/botostore.py#L93 . 
>> This is what you are looking for, just the wrong boto package as well as I 
>> know that this implementation is sadly leaking http-connections and thus 
>> when you access too many files (even in serial) at once, your network will 
>> suffer.
>>
>> Cheers
>> Uwe
>>
>>
>> On Thu, Oct 11, 2018, at 8:01 PM, Luke wrote:
>>
>> I have parquet files (each self contained) in S3 and I want to read certain 
>> columns into a pandas dataframe without reading the entire object out of S3.
>>
>> Is this implemented?  boto3 in python supports reading from offsets in an S3 
>> object but I wasn't sure anyone has made that work with a parquet file 
>> corresponding to certain columns?
>>
>> thanks,
>> Luke
>>
>>
>>

Re: parquet file in S3, is there a way to read a subset of all the columns in python

Reply via email to