Re: How to get "standard" binary columns out of a pyarrow table

Wes McKinney Wed, 31 Jan 2018 06:09:28 -0800

hi Eli,

This isn't available at the moment, but one could make the internal
buffers in an array accessible in Python. How would you handle nulls
in this scenario (the bytes for a null value in a primitive array can
be any value)? How would one handle things other than numbers?


- Wes

On Wed, Jan 31, 2018 at 5:14 AM, Eli <h5r...@protonmail.ch> wrote:
> Hey Wes,
>
>
> What I meant by "standard" is the binary representation of a specific type 
> aggregated together.
>
> The int32 column [1,2,3] would make 
> '\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00' for example.
>
> This is already available via Python's struct.pack(), 
> array.array().tostring() or np.array().astype().tobytes()
>
> What I was wondering is whatever that specific representation is already 
> there in Arrow's C++ mechanics somewhere, and whether one can get hold of it 
> from Pyarrow.
>
> I don't know C++ very well, but I think what I'm looking for is in buffer.h, 
> there are pointers to types under Buffer which I think point to just that.
>
> I saw that Buffer is actually accessible via pa.lib.Buffer, and that it even 
> has a to_pybytes() method.
>
> However:
>
> - I'm not sure those are the bytes that I speak of
>
> - I'm not sure how to use Buffer to find out, keep getting core dumps when 
> trying
>
>
>
> Sent with ProtonMail Secure Email.
>
>
> -------- Original Message --------
>  On January 10, 2018 7:34 PM, Wes McKinney  wrote:
>
>>hi Eli,
>>
>> I am not aware of any standards for binary columns (or at least, I
>> don't know what "regular" means in this context) -- part of the
>> purpose of the Apache Arrow project is to define a columnar standard
>> in the absence of any existing one. Most database systems define their
>> own custom wire protocols.
>>
>> Do you have a link to the specification for the binary protocol for
>> the database you are using (or some other documentation)?
>>
>> Thanks,
>> Wes
>>
>> On Wed, Jan 10, 2018 at 12:47 AM, Eli h5r...@protonmail.ch wrote:
>>>Hey Wes,
>>>The database in question accepts columnar chunks of "regular" binary data 
>>>over the network, one of the sources of which is parquet.
>>>Thus, data only comes out of parquet on my side, and I was wondering how to 
>>>get it out as "regular" binary columns. Something like tobytes() for an 
>>>Arrow Column, or maybe read_asbytes() for pa itself. The purpose is to get 
>>>to standard binary columns as fast as possible.
>>>Thanks,
>>> Eli
>>>Sent with ProtonMail Secure Email.
>>>>-------- Original Message --------
>>>> Subject: Re: How to get "standard" binary columns out of a pyarrow table
>>>> Local Time: January 10, 2018 5:32 AM
>>>> UTC Time: January 10, 2018 3:32 AM
>>>> From: wesmck...@gmail.com
>>>> To: dev@arrow.apache.org, Eli h5r...@protonmail.ch
>>>>hi Eli,
>>>>I'm wondering what kind of API you would want, if the perfect one
>>>> existed. If I understand correctly, you are embedding objects in a
>>>> BYTE_ARRAY column in Parquet, and need to do some post-processing as
>>>> the data goes in / comes out of Parquet?
>>>>Thanks,
>>>> Wes
>>>>On Sat, Jan 6, 2018 at 8:37 AM, Eli h5r...@protonmail.ch wrote:
>>>>>Hi,
>>>>> I'm looking to send "regular" columnar binary data to a database, the 
>>>>> kind that gets created by struct.pack, array.array, numpy.tobytes or 
>>>>> str.encode.
>>>>> The origin is parquet files, which I'm reading ever so comfortably via 
>>>>> PyArrow.
>>>>> I do however need to deserialize to Python objcets, currently via 
>>>>> to_pandas(), then re-serialize the columns with one of the above.
>>>>> I was wondering whether there was a better way to go about it, one which 
>>>>> would be most fast end effective.
>>>>> Ideally I'd like to go through Python, but I can do C or even some C++ if 
>>>>> necessary.
>>>>> I posted the question on stackoverflow, and was asked to post here. 
>>>>> Appreciate any feedback!
>>>>> Thanks,
>>>>> Eli
>>>>> Sent with ProtonMail Secure Email.
>>>>>
>>>>
>>>
>>
>

Re: How to get "standard" binary columns out of a pyarrow table

Reply via email to