hi Eli,

This isn't available at the moment, but one could make the internal
buffers in an array accessible in Python. How would you handle nulls
in this scenario (the bytes for a null value in a primitive array can
be any value)? How would one handle things other than numbers?

- Wes

On Wed, Jan 31, 2018 at 5:14 AM, Eli <h5r...@protonmail.ch> wrote:
> Hey Wes,
>
>
> What I meant by "standard" is the binary representation of a specific type 
> aggregated together.
>
> The int32 column [1,2,3] would make 
> '\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00' for example.
>
> This is already available via Python's struct.pack(), 
> array.array().tostring() or np.array().astype().tobytes()
>
> What I was wondering is whatever that specific representation is already 
> there in Arrow's C++ mechanics somewhere, and whether one can get hold of it 
> from Pyarrow.
>
> I don't know C++ very well, but I think what I'm looking for is in buffer.h, 
> there are pointers to types under Buffer which I think point to just that.
>
> I saw that Buffer is actually accessible via pa.lib.Buffer, and that it even 
> has a to_pybytes() method.
>
> However:
>
> - I'm not sure those are the bytes that I speak of
>
> - I'm not sure how to use Buffer to find out, keep getting core dumps when 
> trying
>
>
>
> Sent with ProtonMail Secure Email.
>
>
> -------- Original Message --------
>  On January 10, 2018 7:34 PM, Wes McKinney  wrote:
>
>>hi Eli,
>>
>> I am not aware of any standards for binary columns (or at least, I
>> don't know what "regular" means in this context) -- part of the
>> purpose of the Apache Arrow project is to define a columnar standard
>> in the absence of any existing one. Most database systems define their
>> own custom wire protocols.
>>
>> Do you have a link to the specification for the binary protocol for
>> the database you are using (or some other documentation)?
>>
>> Thanks,
>> Wes
>>
>> On Wed, Jan 10, 2018 at 12:47 AM, Eli h5r...@protonmail.ch wrote:
>>>Hey Wes,
>>>The database in question accepts columnar chunks of "regular" binary data 
>>>over the network, one of the sources of which is parquet.
>>>Thus, data only comes out of parquet on my side, and I was wondering how to 
>>>get it out as "regular" binary columns. Something like tobytes() for an 
>>>Arrow Column, or maybe read_asbytes() for pa itself. The purpose is to get 
>>>to standard binary columns as fast as possible.
>>>Thanks,
>>> Eli
>>>Sent with ProtonMail Secure Email.
>>>>-------- Original Message --------
>>>> Subject: Re: How to get "standard" binary columns out of a pyarrow table
>>>> Local Time: January 10, 2018 5:32 AM
>>>> UTC Time: January 10, 2018 3:32 AM
>>>> From: wesmck...@gmail.com
>>>> To: dev@arrow.apache.org, Eli h5r...@protonmail.ch
>>>>hi Eli,
>>>>I'm wondering what kind of API you would want, if the perfect one
>>>> existed. If I understand correctly, you are embedding objects in a
>>>> BYTE_ARRAY column in Parquet, and need to do some post-processing as
>>>> the data goes in / comes out of Parquet?
>>>>Thanks,
>>>> Wes
>>>>On Sat, Jan 6, 2018 at 8:37 AM, Eli h5r...@protonmail.ch wrote:
>>>>>Hi,
>>>>> I'm looking to send "regular" columnar binary data to a database, the 
>>>>> kind that gets created by struct.pack, array.array, numpy.tobytes or 
>>>>> str.encode.
>>>>> The origin is parquet files, which I'm reading ever so comfortably via 
>>>>> PyArrow.
>>>>> I do however need to deserialize to Python objcets, currently via 
>>>>> to_pandas(), then re-serialize the columns with one of the above.
>>>>> I was wondering whether there was a better way to go about it, one which 
>>>>> would be most fast end effective.
>>>>> Ideally I'd like to go through Python, but I can do C or even some C++ if 
>>>>> necessary.
>>>>> I posted the question on stackoverflow, and was asked to post here. 
>>>>> Appreciate any feedback!
>>>>> Thanks,
>>>>> Eli
>>>>> Sent with ProtonMail Secure Email.
>>>>>
>>>>
>>>
>>
>

Reply via email to